Abstract
Large Language Models (LLMs) are increasingly applied in high-sensitivity scenarios such as education and career counseling, raising concerns about the potential risks of gender stereotypes. This study examines the performance and impact of LLMs regarding the stereotype that "females have strong empathy while males have weak empathy" through three experiments. Study 1, through a human-machine comparison, found that gender stereotypes in six types of LLMs across the dimensions of emotional empathy, empathic concern, and behavioral empathy were significantly higher than those of humans. Study 2 manipulated input language (Chinese/English) and gender identity (male/female), finding that English contexts and female identity priming were more likely to activate stereotypes in LLMs. Study 3 focused on major and career recommendation tasks, revealing that LLMs tend to recommend majors and careers with high empathy requirements to females, while recommending directions with low empathy requirements to males. Overall, LLMs exhibit significant gender stereotypes regarding empathy; this bias varies with input contexts and can be transferred to real-world recommendation tasks. This research provides a theoretical basis and practical insights for bias identification and fairness optimization in artificial intelligence systems.
Full Text
Abstract
The application of Large Language Models (LLMs) is becoming increasingly widespread in highly sensitive scenarios, such as career counseling. This study investigates the gender stereotypes present in LLMs and their subsequent impact through three experiments. The results indicate that the gender stereotypes exhibited by LLMs are significantly more pronounced than those found in humans. By manipulating the input language, the research found that English contexts and female-associated professions are particularly susceptible to these biases. Specifically, LLMs tend to recommend majors with high empathy requirements to women, while recommending majors with low empathy requirements to men. This demonstrates that LLMs exhibit clear gender stereotypes regarding empathetic ability, which can transfer to real-world recommendation tasks. This research provides a theoretical basis and practical insights for bias identification and the optimization of fairness in artificial intelligence systems.
Keywords
Large Language Models (LLMs), Gender Stereotypes, Empathy, Recommendation, Human-Computer Interaction
1. Introduction
With the rapid development of generative artificial intelligence, Large Language Models (LLMs) are increasingly being applied in scenarios such as educational guidance and career counseling. These systems do not merely serve as tools; to a certain extent, they influence individuals' choices regarding further education and employment paths. Existing research has found that LLMs often exhibit gendered output patterns in tasks such as occupational assignment and character description. For example, models tend to associate men with technical and leadership roles, while linking women to caregiving and service-oriented professions \cite{UNESCO_IRCAI_2024}. These findings suggest that LLMs may inadvertently perpetuate or even amplify existing social gender disparities.
Regarding gender stereotypes in LLMs, existing research has focused primarily on the level of explicit occupational labels, while neglecting the underlying socio-psychological traits. Empathy—the ability of an individual to understand and share the emotional experiences of others \cite{Decety_2010}—plays a critical role in interpersonal communication and career development. Sociocultural norms harbor a pervasive stereotype that empathy is a "female strength" and a "male weakness," a view that is reflected in the gendered division of labor \cite{Croft_Eagly_Steffen_2015}. Do LLMs exhibit similar gender stereotypes along the dimension of empathy? If such biases exist, are they influenced by the input context? Furthermore, do these biases migrate into educational and professional recommendation scenarios, thereby affecting the advice generated by the model? These questions have yet to be empirically tested.
Through three experiments, this paper examines the gender stereotypes in LLMs regarding empathy in comparison to human patterns. We investigate the roles of input language and context in the expression of these stereotypes and further test how such biases manifest in professional recommendation scenarios. This research not only helps expand our understanding of the manifestations of bias in LLMs but also provides empirical evidence and practical insights for ensuring fairness in educational and vocational AI applications.
1.1 Large Language Models and Empathy Bias
Do gender stereotypes regarding empathy exist within Large Language Models (LLMs)? There is a widespread tendency to associate men with technical and leadership roles, such as engineers and scientists, while associating women with caregiving and supportive roles, such as teachers \cite{Sheng_2021}. These biases stem from inherent gendered patterns in training data, the reinforcement effects of algorithms during information compression, and subjective tendencies introduced during human annotation. Such biases may amplify deviations in occupational gender matching into real-world disparities \cite{Kotek_2023}. Existing research primarily focuses on describing biases within professional fields, yet few studies further investigate why LLMs form these biases during occupational tasks.
Empathy is typically divided into three dimensions: emotional empathy (the automatic mimicry and resonance of emotions), empathic concern (the concern for and understanding of others' situations), and behavioral empathy (actually responding to others' needs through actions such as comforting or helping) \cite{Waal_2008, Hoffman_1990}. Gender differences in empathy are primarily manifested in the emotional empathy dimension \cite{Christov-Moore_2014}, while other dimensions are more dependent on specific contexts. According to Social Role Theory, these gender differences arise from the division of labor and gender role expectations rather than reflecting the essence of ability \cite{Eagly_Wood_2012}. Nevertheless, a long-standing stereotype persists in socio-cultural narratives suggesting that women possess stronger empathic abilities, which in turn influences emotional expectations in professional roles.
1.2 Language and Identity Priming
Does gender identity priming influence gender bias? Gender bias in LLMs does not appear stably across all contexts but is influenced by input language and identity priming. Existing research has shown that the input language directly affects the model's output style and reflects corresponding cultural orientations. English is a natural gender language where gender information is embedded in pronouns and noun forms, while Chinese is a gender-neutral language where gender cues often rely on contextual inference \cite{Prewitt-Freilino_2012}. This study focuses on the two most widely used languages globally—Chinese and English—to examine gender stereotypes regarding empathy under different linguistic conditions.
Beyond linguistic factors, identity priming through persona prompts can also significantly influence the expression of stereotypes in LLMs. Persona prompts guide the model to play a specific social role, thereby activating its internally stored semantic associations and social schemas \cite{Gupta_2024}. When a model is primed with a female identity, it is more likely to automatically invoke these social schemas and exhibit stronger empathy stereotypes. Based on this, we hypothesize that gender identity priming will activate varying degrees of empathy-related gender stereotypes: when primed with a female identity, the model will exhibit stronger gender stereotypes.
1.3 LLMs in Professional and Career Recommendations
As LLMs are increasingly applied in real-world scenarios such as education and recruitment, more individuals are turning to these models for career development advice. In the context of career recommendations, LLMs have already demonstrated certain gendered inclinations. For instance, women are frequently recommended for service-oriented roles, while male users are more likely to receive recommendations for technical and managerial positions \cite{Salinas_2023}. Empathy-oriented industries have long faced gender imbalances; for instance, the proportion of men in sectors such as education and social work remains insufficient. In this context, the stereotype that "women possess greater empathy" may be reflected in LLM recommendation results, potentially further influencing the entry and mobility of different gender groups within related professions, thereby deepening existing patterns of occupational segregation \cite{García-Holgado_2021}.
2. Study 1: Human-Machine Comparison of Empathic Gender Stereotypes
2.1 Purpose
This study aims to measure gender stereotypes regarding three-dimensional empathy in LLMs—represented by GPT (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o), DeepSeek (DeepSeek-Reasoner), and ERNIE Bot—and to compare these findings with human adults from both Chinese and Western cultural backgrounds.
2.2 Methodology
2.2.1 Participants and Models
A total of 150 human participants (Mean age = 32.45, SD = 12.72) were recruited through the Prolific platform. For the machine evaluation, each model generated 50 rounds of responses, totaling 400 data points. Generation parameters were standardized, with the temperature set to 1.0.
2.2.2 Experimental Materials and Procedure
To measure the stereotype that women possess strong empathy while men possess weak empathy, this study adapted items from the Empathy Questionnaire (EmQue). Original first-person self-evaluation statements were converted into third-person situational descriptions. For each situation, participants were asked: "Do you think the protagonist is more likely to be a man or a woman?" The proportion of "female" responses was used as the metric for measuring gender stereotypes. The situational materials cover three dimensions: Affective Link, Cognitive Empathy, and Prosocial Motivation.
2.3 Results
The influence of agent type and language type on the proportion of female selections is illustrated in [FIGURE:1]. Using a linear mixed-effects model, the main effect of agent type (human vs. machine) was significant ($b = 0.43, SE = 0.05, 95\% CI [0.33, 0.53], p < 0.05$). LLMs exhibit pronounced gender stereotypes regarding empathy ($M = 0.91, SE = 0.02$) compared to humans ($M = 0.55, SE = 0.03$). This human-machine discrepancy remains stable across all three dimensions of empathy and under both Chinese and English input conditions (Overall: $t = 9.73, p < 0.001, \text{Cohen's } d = 2.64$).
Regarding language, the interaction between language type and affective empathy was significant ($b = 0.22, SE = 0.02, 95\% CI [0.18, 0.26], p < 0.001$). Simple effects analysis revealed that gender stereotypes in LLMs were stronger under English input conditions ($t = 3.14, \text{Cohen's } d = 0.41$).
3. Study 2: The Impact of Contextual Factors and Identity Priming
3.1 Purpose
This study examines the priming of gender identity and analyzes the influence of situational factors on its expression within LLMs.
3.2 Methodology
Under various combinations of gender identity priming and three distinct dimensions of empathy, a total of 14,400 responses were collected from the selected LLMs. Persona prompts were used to guide the models (e.g., "participate in the following socio-emotional game in the persona of an adult Chinese woman"). Validity checks were performed to ensure the models maintained the assigned persona.
3.3 Results
The fixed effects results are presented in [TABLE:1]. The main effect of gender identity priming was significant ($\beta = 0.22, p < 0.01, t = 32.31$). When the priming condition was female, the gender stereotype was significantly higher ($M = 0.91, SD = 0.08$) compared to the male priming condition ($M = 0.67, SD = 0.08$). This priming effect remained stable across all three dimensions of empathy and both language conditions. Furthermore, gender stereotypes observed during English-language input ($\beta = 0.87, SE = 0.08$) were significantly higher than those observed in Chinese conditions ($\beta = 0.71, SE = 0.08$).
4. Study 3: Empathic Gender Stereotypes in Professional Recommendations
4.1 Purpose
This study investigates whether LLMs exhibit biases based on empathetic gender stereotypes when providing academic major and career recommendations.
4.2 Methodology
We selected representative majors and occupations categorized by their empathy requirements (High vs. Low). High-empathy majors included Sociology, Education, and Nursing; low-empathy majors included Mathematics, Physics, and Automation. Models (GPT-4o and DeepSeek) were prompted to rank these options for candidates identified as male, female, or gender-neutral.
4.3 Results
Linear regression results indicate a significant positive correlation between a major's empathy requirements and its perceived suitability for women ($\beta = 0.34, SE = 0.02, t = 16.54$). Cumulative logistic regression revealed a significant three-way interaction between the gender of the recommendee, linguistic framing, and empathy requirements ($\beta = 1.40, SE = 0.12, p < 0.001$).
In English-language inputs, there was a significant tendency to recommend majors with high empathy demands to females ($z = 2.89, p < 0.001, OR = 1.19$). Conversely, males received significantly more recommendations for low-empathy demand occupations ($z = 20.33, p < 0.001, OR = 3.57$). Analysis of the recommendation justifications using LIWC-22 showed that justifications for women involved significantly more emotional and prosocial language, whereas justifications for men emphasized logical and analytical characteristics.
5. General Discussion
This study demonstrates that LLMs exhibit significant gender stereotypes across emotional empathy, empathic concern, and behavioral empathy. Notably, the degree of stereotyping in these models is higher than that observed in human subjects.
5.1 Amplification of Stereotypes
LLMs present a stronger and more consistent pattern of bias across all three dimensions compared to humans. The stereotype that "females possess stronger empathic abilities" is generalized across all dimensions in LLMs, failing to reflect the nuanced differences observed in the real world. This suggests that model bias stems not only from training data but also from the reinforcement effects of algorithmic processing.
5.2 Linguistic and Identity Effects
Empathy-based gender stereotypes in LLMs are significantly stronger under English input conditions. This likely reflects how the linguistic structure of English, with explicit gender pronouns, amplifies model bias compared to the gender-neutral structure of Chinese. Furthermore, female identity priming activates these stereotypes more strongly, potentially reducing the diversity of expression regarding female roles.
5.3 Implications for Recommendations
The assumption that women possess stronger empathic abilities leads models to systematically steer women toward high-empathy roles and men toward low-empathy, technical domains. Such recommendation tendencies may further impact individual development and reinforce social stratification. If recommendation systems rely on these stereotypes, they may limit individual exploration and deepen occupational segregation.
6. Conclusion
LLMs exhibit significant gender stereotypes regarding empathy that exceed human levels. These stereotypes are moderated by input language and identity priming and directly influence professional recommendation outcomes. These findings raise important ethical concerns about fairness in AI-driven decision-making and highlight the urgency of developing robust bias-mitigation strategies for multilingual and socio-emotional contexts.