ChinaRxiv

Emotional Capability Assessment of Multimodal Large Language Models in Dynamic Social Interaction Scenarios

Zhou Zisen, Huang Qi, Tan Zehong, Liu Rui, Cao Ziheng, MU Fangman, Fan Yachun, Qin Shaozheng

Submitted 2025-09-10 | ChinaXiv: chinaxiv-202509.00064

Note: Figures in this paper have not yet been translated.

Abstract

Multimodal Large Language Models (MLLMs) can process and integrate multimodal data information such as images and text, providing a powerful tool for understanding human psychology and cognitive behavior. Combining classic paradigms from emotion psychology, this study separates the different roles of visual features of character dialogues (images) and dialogue content (text) in recognizing and inferring related characters' emotions by comparing the performance of two mainstream MLLMs and human participants in emotion recognition and emotion reasoning under dynamic social interaction scenarios. The results indicate that the performance of MLLMs in emotion recognition and emotion reasoning based on character dialogue images and dialogue content shows moderate or lower correlations with the performance of human participants. Despite there still being a noticeable gap, MLLMs have preliminarily demonstrated emotion recognition and emotion reasoning abilities similar to human participants in dyadic interactions. Using human participants' performance as a reference, the study further compared MLLMs' emotion recognition and emotion reasoning performance under three conditions: based solely on character dialogue images, based solely on dialogue content, and based on the combination of both, finding that visual features of character dialogues somewhat constrain MLLMs' performance in basic emotion recognition but can effectively facilitate complex emotion recognition, while having no significant impact on emotion reasoning. By comparing the performance of two mainstream MLLMs and their different versions (GPT-4-vision/turbo vs. Claude-3-haiku), it was found that, compared to simply scaling up training data size, innovation in technical principle frameworks is more important for improving MLLMs' emotion recognition and reasoning abilities in social interactions. The findings of this study hold important scientific value and significance for understanding the psychological mechanisms of emotion recognition and reasoning in social interactions and for inspiring human-like affective computing and intelligent algorithms.

Full Text

Emotional Capabilities Evaluation of Multimodal Large Language Models in Dynamic Social Interaction Scenarios

Zisen Zhou¹, Qi Huang¹, Zehong Tan², Rui Liu³, Ziheng Cao⁴, Fangman Mu⁵, Yachun Fan², Shaozheng Qin¹*

¹ State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Beijing 100875, China
² School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China
³ School of Business Administration, Inner Mongolia University of Finance and Economics, Hohhot 010070, China
⁴ Alibaba Group, Hangzhou 310020, China
⁵ School of Mathematics and Computer Science, Chuxiong Normal University, Chuxiong 675000, China

Abstract

Multimodal Large Language Models (MLLMs) can process and integrate multimodal data such as images and text, providing powerful tools for understanding human psychology and cognitive behavior. Combining classic paradigms from emotion psychology, this study compared the emotion recognition and inference performance of two mainstream MLLMs with human participants in dynamic social interaction scenarios, aiming to disentangle the distinct roles of visual conversational features (images) and conversational content (text) in recognizing and inferring characters' emotions. Results showed that MLLMs' emotion recognition and inference based on conversational images and content exhibited moderate or weaker correlations with human performance. Despite a noticeable gap, MLLMs have demonstrated preliminary capabilities similar to humans in emotion recognition and inference during dyadic interactions. Using human performance as a benchmark, we further compared MLLMs across three conditions: using only conversational images, only conversational content, or both combined. Visual conversational features constrained basic emotion recognition performance to some extent but effectively facilitated complex emotion recognition, while showing no significant impact on emotion inference. By comparing two mainstream MLLMs and their different versions (GPT-4-vision/turbo vs. Claude-3-haiku), we found that innovations in technical frameworks are more important than simply scaling training data for enhancing MLLMs' emotion recognition and inference capabilities in social interactions.

These findings hold significant scientific value for understanding the psychological mechanisms of emotion recognition and inference in social interactions and for inspiring human-like affective computing and intelligent algorithms.

Keywords: multimodal large language model, social interaction, emotion recognition, emotion inference
Classification Number: B842
Received: 2024-06-23
Funding: National Natural Science Foundation of China Key Project (32130045)
Corresponding Authors: Yachun Fan, fanyachun@bnu.edu.cn; Shaozheng Qin, szqin@bnu.edu.cn

Emotion plays a crucial role in individuals' adaptation to natural and social environments, coping with various stressors, and maintaining mental health. Therefore, a deep understanding of emotion generation mechanisms is essential for revealing human psychological functions. Emotion generation involves a series of complex physiological, psychological, and cognitive processes, such as evaluating and judging external emotional stimuli (Lazarus, 1991) and producing and adjusting emotional expressions like facial expressions and body movements (Ekman, 1993). Emotion is not merely an individual internal phenomenon but also emerges through interactions with others, regulated by social norms and goals, expressed in social contexts, and influences others (Van Kleef & Côté, 2022). Consequently, in social interactions, individuals require at least two emotional capabilities: first, emotion recognition ability—accurately identifying and judging others' internal emotional states based on their emotional expressions to efficiently evaluate external emotional stimuli; second, emotion inference ability—inferring and anticipating the impact of one's own emotional expressions on others to strategically regulate and manage one's expressions accordingly. The coordinated operation of emotion recognition and inference capabilities enables individuals to achieve more effective adaptation and regulation in complex and dynamic social interaction contexts.

Emotion recognition and inference depend on emotional expression, which arises from the organic coordination of nonverbal and verbal information. Nonverbal cues such as facial expressions and body movements can directly convey affective states, motivations, and intentions (Ekman, 1993; Mehrabian, 2017), while verbal information provides necessary supplementation and refined representation of complex emotional content and contextual details (Buck, 1985). Although traditional psychological experiments have revealed interactive effects at the perception and judgment levels by manipulating the consistency or conflict between these modalities (McGurk & MacDonald, 1976; De Gelder & Vroomen, 2000), such paradigms are constrained by static stimuli and highly controlled experimental settings, making it difficult to systematically simulate and predict complex cross-modal information integration processes.

To overcome these limitations, this study introduces Multimodal Large Language Models (MLLMs) to investigate the roles of visual conversational features and conversational content in emotional expression. MLLMs can simultaneously process multimodal data including images and text, providing a powerful computational framework for studying the interaction between nonverbal and verbal information (Zhang et al., 2024). By integrating information from different modalities, MLLMs can capture complex emotional expressions and social cues, thereby achieving a more comprehensive understanding of human affective cognition. Moreover, MLLMs provide researchers with a flexible tool to systematically manipulate and control different modal information without laboratory environment constraints.

Meanwhile, to address the limitation that traditional psychological experiments often rely on static images and text materials that fail to capture the complexity of real social interactions (Schilbach et al., 2013), this study constructed a dynamic social interaction scenario dataset that integrates visual conversational features (images) and conversational content (text), enabling investigation of different modal data's roles in emotional expression. Given the difficulties in collecting dynamic social interaction data—such as challenges in obtaining authentic natural social interaction scenes and privacy protection concerns (Vinciarelli et al., 2009)—this study utilized film and television materials to construct the evaluation dataset. Film and television works offer high ecological validity in terms of emotional expression richness and social interaction authenticity (Busso et al., 2008), providing reliable data sources for emotional capability evaluation. Based on theories of cultural specificity in emotional expression (Matsumoto et al., 2008), this study selected Chinese-language film and television materials to ensure the evaluation tool's applicability and effectiveness within a specific cultural context. Regarding scenario selection, this study focused on dyadic conversation clips from film and television materials. As the most basic unit of social interaction, this design preserves core interaction features such as turn-taking and nonverbal synchrony while avoiding information overload that might occur in multi-person interaction scenarios (Clark & Schaefer, 1989). Additionally, regarding emotion annotation, this study employed emotion soft labels to represent emotional states as probability distributions. Compared to single-category emotion labels, soft labels can more precisely capture subtle differences in emotional expression, reveal the multidimensionality and complexity of emotions in interactive contexts, and thereby enhance the ecological validity of the evaluation (Fayek et al., 2016; Sridhar et al., 2021).

Based on the above, this study focuses on two emotional capabilities closely related to social interactions: first, as a listener, the ability to recognize the speaker's emotions based on their emotional expressions (Li & Deng, 2020); second, as a speaker, the ability to infer the listener's emotions based on one's own emotional expressions (Zhao et al., 2021; Pollmann & Finkenauer, 2009). Using dyadic conversation clips from Chinese-language films, we constructed a dynamic social interaction scenario evaluation dataset that integrates visual conversational features and conversational content. Combined with cognitive-behavioral experimental design, we first compared the emotion recognition and inference performance of multiple MLLMs with human participants to explore whether MLLMs possess human-like emotion recognition and inference capabilities in dynamic social interactions. We then analyzed the performance of GPT-4-vision across different modalities to examine the influence of visual conversational features and conversational content on emotional expression. Furthermore, by comparing MLLMs with different technical principles (GPT-4-vision/turbo vs. Claude-3-haiku) and different training data scales (GPT-4-vision vs. GPT-4-turbo), we investigated the roles of technical architecture and data scale in emotional capability development. Methodologically, this study primarily compared different MLLMs and their modalities through zero-shot performance (Wang et al., 2019), then examined the stability of zero-shot comparison results through repeated measurements to achieve accurate evaluation of MLLMs' emotion recognition and inference performance in dynamic social interaction scenarios.

2.1 Evaluation Dataset

The dynamic social interaction scenario evaluation dataset constructed in this study consists of dyadic conversation clips selected from 15 Chinese-language films. Each conversation segment has a duration of no less than 30 seconds, with no fewer than 3 conversational turns and no fewer than 6 utterances (see Table 1 [TABLE:1]).

Table 1 Information on Dyadic Conversation Clips

Film Title Duration Turns* Utterances Location Relationship Characters From Beijing with Love 1 min 44 sec Goodbye Mr. Loser Finding Mr. Right Under the Hawthorn Tree Love is Not Blind Dying to Survive

Note: One turn refers to each participant speaking once.

Additionally, the evaluation dataset provides location, character images, and relationship information for each dyadic conversation. Considering that pretrained MLLMs might know characters' histories through their names, all character names in the dataset were replaced with character codes (see Table 2 [TABLE:2]).

Table 2 Dyadic Conversation Scenario Information

Film Title Location Characters Relationship Code From Beijing with Love School parking shed A: Ling Ling-qi, B: Li Xiang-qin B appears to cooperate with A but secretly carries a mission to assassinate A K wants to endure hardship, D lets K endure two months in countryside Goodbye Mr. Loser A and F are strangers Finding Mr. Right B's home A: Wen Jia-jia, B: Frank/Hao Zhi B is a driver at a Seattle maternity center, A is a pregnant woman who flew from China Under the Hawthorn Tree B and M's home M is B's mother Love is Not Blind A's psychological counseling studio A and B's temporary residence A's dormitory County magistrate's office Dying to Survive E's psychological counseling studio

For each utterance by characters in the dyadic conversation clips, the evaluation dataset includes: (1) the conversational content text with preceding context (from the beginning of the conversation clip to the current turn), and (2) three uniformly sampled frames from the video (character images containing facial expressions, body language, etc., sampled from the beginning, middle, and end). This information is used for emotion recognition and inference capability evaluation in both MLLMs and human participants.

2.2 Design and Measurement of Emotion Recognition and Inference

This study evaluates two emotional capabilities: emotion recognition and emotion inference. For each utterance in the dyadic conversation scenarios, both MLLMs and human participants can adopt different character perspectives to evaluate emotions: as a listener, recognizing the speaker's emotion; or as a speaker, inferring the listener's emotion (see Figure 1 [FIGURE:1]).

Figure 1 Design and Description of Emotional Capability Evaluation

During emotion recognition and inference tasks, both MLLMs and human participants are provided with 16 selectable emotion labels, including 4 basic emotions: Amusement, Anger, Sadness, Surprise, and 12 complex emotions: Awe, Concentration, Confusion, Contempt, Contentment, Desire, Disappointment, Doubt, Elation, Interest, Pain, Triumph. Although these 16 labels do not cover all possible emotions, they have facial movement patterns that can be effectively identified by DNNs (Cowen et al., 2021), are preserved across multiple cultures (Cordaro et al., 2018; Cowen et al., 2019; Cordaro et al., 2020), and can explain emotional dimensions such as valence, arousal, and avoidance (Cowen & Keltner, 2020; Cowen et al., 2019).

2.3 Evaluation Methods

This study used G*Power 3.1 to estimate sample size, calculating a required sample size of N = 23 (Effect size f = 0.25; α = 0.05, 1 - β = 0.80, single-factor two-level repeated-measures design).

2.3.1 Human Participant Evaluation Method

This study collected data from 36 human participants (21 females, 15 males; mean age 25.33 years, SD = 3.57 years) with valid responses. All participants completed the experiment through the Wenjuanxing platform, performing emotional capability evaluations from both character perspectives for all dyadic conversation clips in the dataset, and received corresponding compensation upon completion.

As shown in Figure 2 [FIGURE:2], before each conversation segment, participants were provided with character images, character codes, relationship information, and location details, and asked to select a character perspective for the subsequent experiment. During the experiment, the conversational text content and three corresponding uniformly sampled frames were presented sequentially. Previously shown text remained visible, while newly appearing text (i.e., the content to be judged) was highlighted in red. When the utterance to be judged was spoken by the participant's selected character, the instruction read: "What emotion do you think {other character's code} felt while listening to your last sentence?" When spoken by the other character, it read: "What emotion do you think {other character's code} felt when saying the last sentence?" Participants selected 1-3 emotion labels from the 16 options and ranked them by degree of match. After completing the entire conversation, participants were instructed to switch to the other character's perspective and repeat the same conversation segment. All questions had no correct or incorrect answers; participants were instructed to respond based on their understanding of the scenario and characters.

Figure 2 Example of Human Participant Evaluation Flowchart

2.3.2 MLLM Evaluation Method

To batch-obtain MLLM emotion recognition and inference results via scripting, this study employed API interface calls. Calling model APIs is essentially no different from direct user interface interaction—both use prompts to obtain model outputs.

This study obtained MLLM zero-shot emotion recognition and inference results through a single API call for analyzing and comparing performance across different MLLMs and modalities. Subsequently, 25 repeated API calls were made to obtain representative repeated-measurement results for examining the stability of zero-shot performance.

The prompt for MLLM dual-modality zero-shot emotion recognition and inference evaluation included: character images, character codes, relationship information, location, conversational content with preceding context, and three uniformly sampled frames of the current utterance. The model was asked to adopt either the listener's perspective for emotion recognition or the speaker's perspective for emotion inference, and output probabilities for the 16 emotion labels. Example prompts for zero-shot emotion recognition and inference are shown in Figures S1 and S2, respectively.

Based on the dual-modality zero-shot prompt, the image-only modality prompt removed linguistic information containing preceding conversational content, while the text-only modality prompt removed image information including character images and the three sampled frames. For repeated-measurement evaluation, the output requirement was modified to selecting 1-3 emotion labels from the 16 options and ranking them by degree of match (Figures S3-S4).

To enable better affective analysis of characters, the prompt began with a "research assistant" system role setting and included keywords at the end such as "clear movie-related memories," "can make any comments about characters in the images," and "comments must be based on characters' emotions in the images."

For the few erroneous responses from MLLMs, zero-shot evaluation results underwent manual review to ensure emotion labels and probabilities were returned, though not strictly requiring every returned label to belong to the provided 16 options. Nevertheless, over 90% of all zero-shot evaluation responses returned correct emotion labels. For repeated-measurement results, automated scripts ensured only 1-3 labels from the 16 options were returned.

2.4 Statistical Analysis Methods

2.4.1 Analysis Based on Emotion Label-Dyadic Conversation Probability Distribution Matrix

For MLLMs returning probability distributions across 16 emotion labels, we aggregated all dyadic conversation scenarios' emotion label probability distributions to generate emotion label-dyadic conversation probability distribution matrices for MLLM emotion recognition and inference performance (16×149, see Figure S5).

For human participants and MLLMs returning 1-3 ranked emotion labels, we considered both the number of selected labels and their ranking using the following weight accumulation rules: (1) If 1 label was selected, its weight +6; (2) If 2 labels were selected, the first label weight +4, second label weight +2; (3) If 3 labels were selected, the first label weight +3, second label weight +2, third label weight +1.

Based on these rules, we calculated the probability distribution of 16 emotion labels recognized or inferred by human participants and MLLMs in each dyadic conversation scenario (current scenario emotion label weight / total weight for that scenario), thereby obtaining the emotion label-dyadic conversation probability distribution matrices for all scenarios (16×149, see Figure 3 [FIGURE:3]A).

We then performed Spearman correlation analysis between MLLM and human participant matrices to obtain Spearman correlation coefficients, enabling comparison of different MLLMs and modalities via Fisher's Z tests.

2.4.2 Analysis Based on Mean Emotion Label Probability Distribution

Each row of the emotion label-dyadic conversation probability distribution matrix represents the probability distribution of a specific emotion label across 149 dyadic conversation scenarios as recognized or inferred by human participants or MLLMs. The mean probability for each emotion label indicates the likelihood of the entire evaluation dataset being recognized or inferred as that label. Higher mean probability suggests a stronger tendency to identify or infer that emotion across all scenarios.

Independent samples t-tests compared MLLM and human participant mean probability distributions for the 16 emotion labels. Analyzing differences in emotion label preferences between MLLMs and humans can measure evaluation performance across different models and modalities.

3 Results

By analyzing questionnaire data collected using the human participant method (see 2.3.1), we calculated human participants' emotion label-dyadic conversation probability distribution matrices and mean probability distributions for both emotion recognition and inference, shown in Figures 3 and 4.

Figure 3 Human Participant Emotion Recognition: Probability Distribution Matrix and Mean Distribution

Figure 4 Human Participant Emotion Inference: Probability Distribution Matrix and Mean Distribution

To assess questionnaire reliability, we calculated internal consistency (Cronbach's α) of participant performance. Results showed high reliability for both emotion recognition and inference (α = 0.98 for both).

Meanwhile, we collected MLLM emotion recognition and inference results for GPT-4-vision image modality, GPT-4-vision text modality, GPT-4-vision dual modality, GPT-4-turbo dual modality, and Claude-3-haiku dual modality using the MLLM evaluation method (see 2.3.2), obtaining zero-shot emotion label-dyadic conversation probability distribution matrices (Figures S5-S14).

3.1 Spearman Correlation Analysis Between MLLM Zero-Shot Performance and Human Participants

First, we compared overall similarity between MLLM zero-shot emotion recognition/inference matrices and human participants using the method described in 2.4.1. Emotion recognition results are shown in Figure 5 [FIGURE:5]; emotion inference results in Figure 6 [FIGURE:6].

Comparing Spearman correlation coefficients between GPT-4-vision's three modalities (image/text/dual) and human participants revealed:

For basic emotion recognition, GPT-4-vision dual modality's correlation with humans (Spearman's rho: 0.48, 95% CI [0.41, 0.55], Fisher's Z = 0.52, p < 0.001) was significantly greater than image modality's correlation (Spearman's rho: 0.26, 95% CI [0.19, 0.34], Fisher's Z = 0.27, p < 0.001) (z = 4.32, p < 0.001). Text modality's correlation (Spearman's rho: 0.42, 95% CI [0.35, 0.49], Fisher's Z = 0.45, p < 0.001) was also significantly greater than image modality's (z = 3.04, p = 0.002). No significant difference existed between dual and text modalities (z = 1.28, p = 0.201).

For complex emotion recognition, dual modality's correlation (Spearman's rho: 0.48, 95% CI [0.45, 0.52], Fisher's Z = 0.53, p < 0.001) was significantly greater than image modality's (Spearman's rho: 0.35, 95% CI [0.30, 0.39], Fisher's Z = 0.36, p < 0.001) (z = 4.99, p < 0.001). Text modality's correlation (Spearman's rho: 0.41, 95% CI [0.37, 0.45], Fisher's Z = 0.44, p < 0.001) was also significantly greater than image modality's (z = 2.34, p = 0.019). Dual modality's correlation was significantly greater than text modality's (z = 2.64, p = 0.008).

For basic emotion inference, dual modality's correlation (Spearman's rho: 0.41, 95% CI [0.34, 0.48], Fisher's Z = 0.44, p < 0.001) was significantly greater than image modality's (Spearman's rho: 0.21, 95% CI [0.13, 0.28], Fisher's Z = 0.21, p < 0.001) (z = 3.98, p < 0.001). Text modality's correlation (Spearman's rho: 0.45, 95% CI [0.39, 0.52], Fisher's Z = 0.49, p < 0.001) was also significantly greater than image modality's (z = 4.80, p = 0.019). No significant difference existed between dual and text modalities (z = -0.82, p = 0.410).

For complex emotion inference, no significant differences were found between dual modality (Spearman's rho: 0.47, 95% CI [0.43, 0.50], Fisher's Z = 0.51, p < 0.001) and image modality (Spearman's rho: 0.42, 95% CI [0.38, 0.46], Fisher's Z = 0.45, p < 0.001) (z = 1.84, p = 0.066), between text modality (Spearman's rho: 0.43, 95% CI [0.39, 0.47], Fisher's Z = 0.46, p < 0.001) and image modality (z = 0.34, p = 0.737), or between dual and text modalities (z = 1.50, p = 0.133).

Comparison across three MLLMs (GPT-4-vision/GPT-4-turbo/Claude-3-haiku) showed:

For basic emotion recognition, GPT-4-vision dual modality's correlation was significantly greater than Claude-3-haiku dual modality's (Spearman's rho: 0.29, 95% CI [0.21, 0.37], Fisher's Z = 0.30, p < 0.001) (z = 3.82, p < 0.001). GPT-4-turbo dual modality's correlation (Spearman's rho: 0.44, 95% CI [0.36, 0.50], Fisher's Z = 0.47, p < 0.001) was also significantly greater than Claude-3-haiku's (z = 2.91, p = 0.004). No significant difference existed between GPT-4-vision and GPT-4-turbo dual modalities (z = 0.92, p = 0.360).

For complex emotion recognition, GPT-4-vision dual modality's correlation was significantly greater than Claude-3-haiku's (Spearman's rho: 0.23, 95% CI [0.18, 0.27], Fisher's Z = 0.23, p < 0.001) (z = 8.92, p < 0.001). GPT-4-turbo dual modality's correlation (Spearman's rho: 0.42, 95% CI [0.38, 0.46], Fisher's Z = 0.45, p < 0.001) was also significantly greater than Claude-3-haiku's (z = 6.63, p < 0.001). GPT-4-vision dual modality's correlation was significantly greater than GPT-4-turbo's (z = 2.30, p = 0.022).

For basic emotion inference, GPT-4-vision dual modality's correlation was significantly greater than Claude-3-haiku's (Spearman's rho: 0.12, 95% CI [0.05, 0.20], Fisher's Z = 0.12, p = 0.003) (z = 5.48, p < 0.001). GPT-4-turbo dual modality's correlation (Spearman's rho: 0.43, 95% CI [0.36, 0.49], Fisher's Z = 0.46, p < 0.001) was also significantly greater than Claude-3-haiku's (z = 5.78, p < 0.001). No significant difference existed between GPT-4-vision and GPT-4-turbo dual modalities (z = -0.30, p = 0.764).

For complex emotion inference, GPT-4-vision dual modality's correlation was significantly greater than Claude-3-haiku's (Spearman's rho: 0.29, 95% CI [0.24, 0.33], Fisher's Z = 0.30, p < 0.001) (z = 6.34, p < 0.001). GPT-4-turbo dual modality's correlation (Spearman's rho: 0.43, 95% CI [0.39, 0.47], Fisher's Z = 0.46, p < 0.001) was also significantly greater than Claude-3-haiku's (z = 4.84, p < 0.001). No significant difference existed between GPT-4-vision and GPT-4-turbo dual modalities (z = 1.50, p = 0.134).

Figure 5 Spearman Correlation Analysis and Comparison of MLLM Zero-Shot Emotion Recognition

Figure 6 Spearman Correlation Analysis and Comparison of MLLM Zero-Shot Emotion Inference

To test the stability of zero-shot correlation results, we selected GPT-4-vision dual modality and GPT-4-turbo dual modality—which showed significant differences only in complex emotion recognition—and repeated the evaluation 25 times (see 2.3.2). As shown in Figures S15 and S16, repeated-measurement correlation results remained consistent with zero-shot results.

3.2 Independent Samples t-Tests Between MLLM Zero-Shot Performance and Human Participants

We further compared differences in mean probability distributions for recognizing and inferring 4 basic emotions and 12 complex emotions between MLLMs and human participants using the method in 2.4.2. Emotion recognition results are shown in Figure 7; emotion inference results in Figure 8. Emotion labels without significant differences from human performance are boxed (all p-values corrected for multiple comparisons).

GPT-4-vision image modality showed no significant differences from humans in recognizing 2 basic emotions (Sadness/Surprise) and 4 complex emotions (Desire/Disappointment/Interest/Pain), and in inferring 1 basic emotion (Sadness) and 4 complex emotions (Disappointment/Elation/Pain/Triumph) (see Tables S7, S8).

GPT-4-vision text modality showed no significant differences from humans in recognizing all 4 basic emotions (Amusement/Anger/Sadness/Surprise) and 6 complex emotions (Contempt/Desire/Disappointment/Elation/Interest/Pain), and in inferring 2 basic emotions (Amusement/Anger) and 6 complex emotions (Contempt/Contentment/Disappointment/Elation/Interest/Triumph) (see Tables S9, S10).

GPT-4-vision dual modality showed no significant differences from humans in recognizing 2 basic emotions (Amusement/Surprise) and 7 complex emotions (Concentration/Contempt/Desire/Disappointment/Elation/Interest/Pain), and in inferring 2 basic emotions (Anger/Sadness) and 7 complex emotions (Concentration/Contempt/Disappointment/Elation/Interest/Pain/Triumph) (see Tables S3, S4).

GPT-4-turbo dual modality showed no significant differences from humans in recognizing all 4 basic emotions (Amusement/Anger/Sadness/Surprise) and 7 complex emotions (Concentration/Contempt/Desire/Disappointment/Elation/Interest/Pain), and in inferring 2 basic emotions (Anger/Sadness) and 6 complex emotions (Contempt/Contentment/Disappointment/Interest/Pain/Triumph) (see Tables S5, S6).

Claude-3-haiku dual modality showed no significant differences from humans in recognizing 0 basic emotions and 4 complex emotions (Concentration/Contentment/Desire/Doubt), and in inferring 0 basic emotions and 3 complex emotions (Concentration/Disappointment/Doubt) (see Tables S1, S2).

Figure 7 Independent Samples t-Tests for MLLM Zero-Shot Emotion Recognition

Figure 8 Independent Samples t-Tests for MLLM Zero-Shot Emotion Inference

Integrating overall similarity and mean probability distribution consistency between MLLM and human performance revealed:

All MLLMs showed moderate or weaker correlations with human participants in emotion recognition and inference, with over half of the emotion labels showing different probability distributions from humans.

Across modalities, GPT-4-vision dual modality outperformed image modality but underperformed text modality in basic emotion recognition; outperformed both image and text modalities in complex emotion recognition; and outperformed image modality while showing no difference from text modality in both basic and complex emotion inference.

Comparing different MLLMs, GPT-4-vision dual modality outperformed Claude-3-haiku dual modality in emotion recognition and inference. Comparing different training data scales, GPT-4-vision dual modality underperformed GPT-4-turbo dual modality in basic emotion recognition but outperformed it in complex emotion recognition, while showing no difference in basic and complex emotion inference.

Discussion

This study used dyadic conversation clips from Chinese-language films to evaluate GPT-4-vision image modality, GPT-4-vision text modality, GPT-4-vision dual modality, GPT-4-turbo dual modality, Claude-3-haiku dual modality, and human participants. Participants adopted character perspectives for emotion recognition or inference. We first compared different MLLMs with humans in dynamic social interaction scenarios to explore whether MLLMs possess human-like emotion recognition and inference capabilities. Then, using human soft labels in dynamic social interactions as a benchmark, we compared GPT-4-vision across modalities to examine the respective roles of visual conversational features and conversational content in emotional expression.

Comparing different MLLMs revealed that GPT-4-vision, GPT-4-turbo, and Claude-3-haiku using both visual features and conversational content showed moderate or weaker similarity with human participants in dynamic social interaction scenarios. This indicates that MLLMs have begun to demonstrate preliminary human-like emotion recognition and inference capabilities. This performance primarily stems from MLLMs' ability to comprehensively process multiple sensory channels and understand contexts. MLLMs can simultaneously consider emotional vocabulary in text, facial expressions and body language in images, and other information sources to achieve more comprehensive emotion understanding and recognition. Previous research has shown that emotional vocabulary, facial expressions, and body language play important roles in emotion recognition (Ekman & Friesen, 1978; Mehrabian, 2017). Beyond integrating multiple information types, MLLMs combine contextual understanding of these information sources to comprehend the background and motivation of emotional expressions, enabling more accurate emotion recognition and inference in complex social interactions (Lazarus, 1991; Strack & Deutsch, 2004).

Furthermore, comparative analysis of GPT-4-vision using visual features only, conversational content only, and both combined in dynamic social interaction scenarios demonstrated that linguistic information plays a crucial role in emotional expression. Ekman (1992) noted that facial expressions and body language can reflect preliminary emotional signals, but these nonverbal cues are often ambiguous and uncertain without conversational content support: a smiling face might convey pleasant feelings, but if the smile occurs in a sarcastic context, the expressed emotion could be completely different. Linguistic information helps clarify the true intention of emotional expression by providing emotional sources, event descriptions, and verbal tones, offering necessary context that makes emotional information interpretation more explicit and precise. Therefore, understanding emotional expression depends not only on the emotion itself but also on the nature and context of linguistic content.

Moreover, the different roles of visual conversational features in emotion recognition versus inference revealed distinct conclusions. On one hand, visual features interfered with basic emotion recognition but facilitated complex emotion recognition, suggesting that complex emotional expression relies more on visual features than basic emotion expression. Due to their directness and universality, basic emotions can rely on intuitive vocabulary and sentence structures without extensive cognitive processing—expressions like "I am happy" or "I am angry" directly reflect emotional states and are easily understood by others (Lindquist et al., 2016). When visual features and linguistic information express inconsistent basic emotions, cognitive conflict typically arises, affecting basic emotion interpretation. Complex emotions, due to their complexity and diversity, often cannot be accurately conveyed by vocabulary and phrases alone. Besides requiring more refined language, they depend more on emotional meanings carried by visual features for accurate expression (Russell, 2003). On the other hand, visual features showed limited impact on both basic and complex emotion inference, reflecting that linguistic information in emotional expression has a stronger effect on regulating others' emotions. Clear linguistic expression can both help others accurately understand emotional content and intentions (Ekman & Friesen, 2003) and provide new perspectives or frameworks for interpreting emotion-eliciting events (Gross, 2015). For example, when a friend feels frustrated over exam failure, a hug might be interpreted as friendly comfort or perfunctory gesture and may not trigger positive cognitive transformation. In contrast, statements like "I understand how you feel" or "This is just a small setback on your learning path; you'll do better next time" can both explicitly express support and empathy and guide the friend to re-examine the event's importance, reducing negative emotional impact.

Additionally, comparing GPT-4-vision with Claude-3-haiku (different technical principles) and GPT-4-turbo (different training data scale) in dynamic social interaction scenarios revealed that technical principle innovations can enhance MLLMs' emotion recognition and inference performance. For instance, the Transformer framework with self-attention mechanisms can effectively capture dependencies between different positions in input sequences, enabling models to attend to distant relevant information rather than just neighboring information, thus more effectively capturing and processing complex emotional signals and contextual information (Vaswani et al., 2017). However, training dataset scale expansion only improved basic emotion recognition performance, showing no impact on basic and complex emotion inference, and even reducing complex emotion recognition performance. Introducing more diverse and larger-scale data allows models to encounter more varied emotional expression patterns, learning richer emotional features and modes for accurate basic emotion recognition across broader contexts (Goodfellow et al., 2016; Poria et al., 2017). Complex emotions involve mixing multiple basic emotions and sophisticated contextual understanding, with diverse expressions that are difficult to standardize (Plutchik, 1980; Barrett, 2006). Although dataset expansion provides more emotional samples, without deep modeling of contexts and complex emotional relationships, models still struggle to accurately understand complex emotions (Barrett et al., 2011; Kosti et al., 2017). Compared to emotion recognition, emotion inference involves more complex cognitive and affective processes, including empathy mechanisms that require combining training datasets, pretrained models, contextual modeling, perspective-taking, reinforcement learning, and other techniques (Su et al., 2016; Ghosal et al., 2019). Merely updating training datasets has limited effect on improving emotion inference.

To better understand human intelligence, future psychological and cognitive neuroscience research should increasingly integrate MLLMs. MLLMs' ability to process and integrate multimodal data like images and text provides a more precise and comprehensive perspective for psychological research. Psychological research, particularly in cognition, emotion, social interaction, and individual differences, faces challenges of data diversity and complexity. MLLMs can effectively fuse and process these diverse information types, revealing multidimensional features of human psychological processes. Moreover, using MLLMs enables researchers to deeply analyze interactions between different modalities, revealing how the brain integrates multimodal information and providing a new research framework for investigating emotion-cognition interactions. Correspondingly, psychological research can provide theoretical support for model development in cognitive capabilities, emotional intelligence, social interaction, and personalized services, while cognitive neuroscience can offer important guidance on multimodal information integration, attention mechanisms, learning and memory, and decision-making mechanisms. The synergistic development of psychology, cognitive neuroscience, and artificial intelligence will not only advance AI intelligence but also provide powerful tools for understanding human behavior and brain mechanisms.

This study has several limitations. First, in the test dataset, uniformly sampled frames with equal temporal intervals sometimes included not only the current speaker's content and images but also the next speaker's, potentially affecting emotion recognition and inference. Second, although human participants selected a character perspective, they had to immediately switch to the other perspective after completing each conversation segment. Knowing the content and plot in advance, and potential difficulty adapting to sudden perspective shifts, may have caused emotional judgment biases. Third, although GPT-4 and Claude-3 differ in technical principles and training datasets, both rely heavily on massive internet public text data with highly overlapping corpora, making it difficult to attribute performance differences solely to technical principles. Finally, during zero-shot evaluation, we did not strictly screen out responses where models failed to adopt the specified perspective when providing emotion label probability distributions or rankings.

In summary, this study constructed a dynamic social interaction scenario evaluation dataset integrating visual conversational features and conversational content using dyadic conversation clips from Chinese-language films. By comparing emotion recognition and inference performance between two mainstream MLLMs and human participants, we found that MLLMs have begun to demonstrate human-like emotion recognition and inference capabilities and revealed the roles of visual features and conversational content in emotional expression. Moreover, technical principle innovation is more critical than training data scale expansion for enhancing MLLMs' emotional capabilities in dynamic social interaction scenarios. Future psychological and cognitive neuroscience research should increasingly integrate MLLMs, providing strong support for deeply analyzing human behavior and brain mechanisms while further advancing artificial intelligence development.

References

Barrett, L. F. (2006). Are emotions natural kinds?. Perspectives on Psychological Science, 1(1), 28−58.

Barrett, L. F., Mesquita, B., & Gendron, M. (2011). Context in emotion perception. Current Directions in Psychological Science, 20(5), 286−290.

Buck, R. (1985). Prime theory: An integrated view of motivation and emotion. Psychological Review, 92(3), 389.

Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., ... & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42, 335−359.

Clark, H. H., & Schaefer, E. F. (1989). Contributing to discourse. Cognitive Science, 13(2), 259−294.

Cordaro, D. T., Sun, R., Kamble, S., Hodder, N., Monroy, M., Cowen, A., ... & Keltner, D. (2020). The recognition of 18 facial-bodily expressions across nine cultures. Emotion, 20(7), 1292.

Cordaro, D. T., Sun, R., Keltner, D., Kamble, S., Huddar, N., & McNeil, G. (2018). Universals and cultural variations in 22 emotional expressions across five cultures. Emotion, 18(1), 75.

Cowen, A. S., & Keltner, D. (2020). What the face displays: Mapping 28 emotions conveyed by naturalistic expression. American Psychologist, 75(3), 349.

Cowen, A. S., Elfenbein, H. A., Laukka, P., & Keltner, D. (2019). Mapping 24 emotions conveyed by brief human vocalization. American Psychologist, 74(6), 698.

Cowen, A. S., Keltner, D., Schroff, F., Jou, B., Adam, H., & Prasad, G. (2021). Sixteen facial expressions occur in similar contexts worldwide. Nature, 589(7841), 251−257.

Cowen, A. S., Laukka, P., Elfenbein, H. A., Liu, R., & Keltner, D. (2019). The primacy of categories in the recognition of 12 emotions in speech prosody across two cultures. Nature Human Behaviour, 3(4), 369−382.

De Gelder, B., & Vroomen, J. (2000). The perception of emotions by ear and by eye. Cognition & Emotion, 14(3), 289−311.

Ekman, P. (1992). An argument for basic emotions. Cognition & Emotion, 6(3−4), 169−200.

Ekman, P. (1993). Facial expression and emotion. American Psychologist, 48(4), 384.

Ekman, P., & Friesen, W. V. (1978). Facial Action Coding System (FACS) [Database record]. APA PsycTests.

Ekman, P., & Friesen, W. V. (2003). Unmasking the face: A guide to recognizing emotions from facial clues (Vol. 10). Ishk.

Fayek, H. M., Lech, M., & Cavedon, L. (2016, July). Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels. In 2016 International Joint Conference on Neural Networks (IJCNN) (pp. 566−570). IEEE.

Ghosal, D., Majumder, N., Poria, S., Chhaya, N., & Gelbukh, A. (2019). Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. arXiv preprint arXiv:1908.11540.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.

Gross, J. J. (2015). Emotion regulation: Current status and future prospects. Psychological Inquiry, 26(1), 1−26.

Kosti, R., Alvarez, J. M., Recasens, A., & Lapedriza, A. (2017). Emotion recognition in context. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1667−1675).

Lazarus, R. S. (1991). Emotion and adaptation. Oxford University Press.

Li, S., & Deng, W. (2020). Deep facial expression recognition: A survey. IEEE transactions on affective computing, 13(3), 1195−1215.

Lian, Z., Sun, L., Sun, H., Chen, K., Wen, Z., Gu, H., ... & Tao, J. (2024). GPT-4V with emotion: A zero-shot benchmark for Generalized Emotion Recognition. Information Fusion, 108, 102367.

Lindquist, K. A., Barrett, L. F., Bliss-Moreau, E., & Russell, J. A. (2006). Language and the perception of emotion. Emotion, 6(1), 125.

Matsumoto, D., Yoo, S. H., & Nakagawa, S. (2008). Culture, emotion regulation, and adjustment. Journal of Personality and Social Psychology, 94(6), 925.

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746−748.

Mehrabian, A. (2017). Nonverbal communication. Routledge.

Plutchik, R. (1980). A general psychoevolutionary theory of emotion. In Theories of emotion (pp. 3−33). Academic press.

Pollmann, M. M., & Finkenauer, C. (2009). Empathic forecasting: How do we predict other people's feelings?. Cognition and Emotion, 23(5), 978−1001.

Poria, S., Cambria, E., Bajpai, R., & Hussain, A. (2017). A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37, 98−125.

Russell, J. A. (2003). Core affect and the psychological construction of emotion. Psychological Review, 110(1), 145.

Schilbach, L., Timmermans, B., Reddy, V., Costall, A., Bente, G., Schlicht, T., & Vogeley, K. (2013). Toward a second-person neuroscience. Behavioral and Brain Sciences, 36(4), 393−414.

Sridhar, K., Lin, W. C., & Busso, C. (2021, September). Generative approach using soft-labels to learn uncertainty in predicting emotional attributes. In 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII) (pp. 1−8). IEEE.

Strack, F., & Deutsch, R. (2004). Reflective and impulsive determinants of social behavior. Personality and Social Psychology Review, 8(3), 220−247.

Su, P. H., Gasic, M., Mrksic, N., Rojas-Barahona, L., Ultes, S., Vandyke, D., ... & Young, S. (2016). On-line active reward learning for policy optimisation in spoken dialogue systems. arXiv preprint arXiv:1605.07669.

Van Kleef, G. A., & Côté, S. (2022). The Social Effects of Emotions. Annual review of psychology, 73, 629–658.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Vinciarelli, A., Pantic, M., & Bourlard, H. (2009). Social signal processing: Survey of an emerging domain. Image and Vision Computing, 27(12), 1743−1759.

Wang, W., Zheng, V. W., Yu, H., & Miao, C. (2019). A survey of zero-shot learning: Settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2), 1−37.

Zhang, D., Yu, Y., Dong, J., Li, C., Su, D., Chu, C., & Yu, D. (2024). Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601.

Zhao, S., Yao, X., Yang, J., Jia, G., Ding, G., Chua, T. S., ... & Keutzer, K. (2021). Affective image content analysis: Two decades review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 6729−6751.

Table S1 Claude3-haiku Zero-Shot vs. Human Participants: Independent Samples t-Test Results for Emotion Recognition

Emotion Label Cohen's d 95% CI Amusement 0.07±0.08 [0.03, 0.08] Anger 0.13±0.15 [0.01, 0.09] Sadness 0.11±0.17 [-0.05, -0.02] Surprise 0.16±0.17 [-0.06, -0.03] Awe 0.05±0.07 [-0.04, -0.02] Concentration 0.01±0.03 [-0.05, -0.03] Confusion 0.05±0.09 [-0.04, -0.01] Contempt 0.00±0.02 [-0.03, 0.00] Contentment 0.06±0.07 [0.00, 0.05] Desire 0.12±0.12 [-0.02, 0.02] Disappointment 0.13±0.12 [-0.02, 0.00] Doubt 0.09±0.11 [-0.01, 0.03] Elation 0.19±0.12 [-0.02, 0.01] Interest 0.09±0.11 [-0.03, 0.00] Pain 0.04±0.07 [-0.01, 0.00] Triumph 0.07±0.08 [-0.02, 0.00]

Note: All p-values corrected for multiple comparisons; same below

Table S2 Claude3-haiku Zero-Shot vs. Human Participants: Independent Samples t-Test Results for Emotion Inference

Emotion Label Cohen's d 95% CI Amusement 0.07±0.07 [0.01, 0.06] Anger 0.10±0.13 [0.05, 0.11] Sadness 0.09±0.13 [-0.05, -0.02] Surprise 0.17±0.15 [-0.14, -0.11] Awe 0.05±0.07 [-0.02, 0.04] Concentration 0.02±0.04 [0.02, 0.04] Confusion 0.13±0.10 [-0.03, -0.01] Contempt 0.01±0.03 [0.00, 0.03] Contentment 0.05±0.06 [0.02, 0.04] Desire 0.10±0.09 [-0.03, 0.00] Disappointment 0.13±0.11 [0.05, 0.08] Doubt 0.09±0.06 [-0.02, 0.01] Elation 0.20±0.11 [-0.03, 0.00] Interest 0.06±0.07 [-0.03, 0.00] Pain 0.04±0.06 [-0.01, 0.01] Triumph 0.03±0.05 [-0.02, 0.00]

Table S3 GPT-4-vision Zero-Shot vs. Human Participants: Independent Samples t-Test Results for Emotion Recognition

Emotion Label Cohen's d 95% CI Amusement 0.07±0.08 [-0.03, 0.01] Anger 0.06±0.08 [-0.07, -0.01] Sadness 0.11±0.17 [-0.03, 0.00] Surprise 0.07±0.11 [-0.03, 0.01] Awe 0.05±0.07 [0.01, 0.02] Concentration 0.03±0.05 [-0.08, -0.03] Confusion 0.05±0.09 [0.01, 0.06] Contempt 0.04±0.04 [-0.04, 0.00] Contentment 0.01±0.02 [-0.03, 0.01] Desire 0.03±0.03 [0.03, 0.06] Disappointment 0.12±0.12 [-0.04, 0.01] Doubt 0.09±0.11 [0.02, 0.07] Elation 0.09±0.11 [-0.04, 0.01] Interest 0.03±0.05 [0.03, 0.06] Pain 0.07±0.08 [-0.02, 0.01] Triumph 0.06±0.07 [-0.03, 0.00]

Table S4 GPT-4-vision Zero-Shot vs. Human Participants: Independent Samples t-Test Results for Emotion Inference

Emotion Label Cohen's d 95% CI Amusement 0.07±0.07 [-0.04, -0.02] Anger 0.05±0.06 [-0.07, -0.02] Sadness 0.09±0.13 [-0.01, 0.01] Surprise 0.06±0.09 [-0.11, -0.07] Awe 0.05±0.07 [0.01, 0.02] Concentration 0.04±0.04 [0.02, 0.06] Confusion 0.13±0.10 [-0.03, 0.01] Contempt 0.05±0.05 [0.04, 0.07] Contentment 0.02±0.03 [-0.03, -0.01] Desire 0.10±0.09 [0.00, 0.03] Disappointment 0.09±0.06 [0.05, 0.08] Doubt 0.06±0.07 [-0.02, 0.01] Elation 0.04±0.06 [0.01, 0.04] Interest 0.02±0.02 [0.02, 0.04] Pain 0.08±0.10 [-0.02, 0.02] Triumph 0.07±0.05 [-0.03, 0.00]

Table S5 GPT-4-turbo Zero-Shot vs. Human Participants: Independent Samples t-Test Results for Emotion Recognition

Emotion Label Cohen's d 95% CI Amusement 0.07±0.08 [-0.04, 0.00] Anger 0.05±0.08 [-0.06, 0.01] Sadness 0.11±0.17 [-0.03, 0.00] Surprise 0.08±0.14 [-0.02, 0.02] Awe 0.05±0.07 [0.02, 0.04] Concentration 0.03±0.05 [-0.05, 0.00] Confusion 0.05±0.09 [0.03, 0.08] Contempt 0.05±0.07 [-0.04, 0.01] Contentment 0.01±0.02 [0.01, 0.04] Desire 0.12±0.12 [-0.05, 0.00] Disappointment 0.09±0.11 [0.03, 0.07] Doubt 0.09±0.11 [-0.03, 0.01] Elation 0.07±0.09 [-0.01, 0.03] Interest 0.03±0.05 [-0.03, 0.00] Pain 0.06±0.08 [-0.03, 0.00] Triumph 0.11±0.10 [-0.04, -0.02]

Table S6 GPT-4-turbo Zero-Shot vs. Human Participants: Independent Samples t-Test Results for Emotion Inference

Emotion Label Cohen's d 95% CI Amusement 0.07±0.07 [-0.04, -0.01] Anger 0.04±0.05 [-0.04, 0.01] Sadness 0.09±0.13 [-0.02, 0.01] Surprise 0.07±0.09 [-0.09, -0.06] Awe 0.05±0.07 [0.01, 0.02] Concentration 0.05±0.05 [-0.04, 0.00] Confusion 0.13±0.10 [0.05, 0.08] Contempt 0.06±0.06 [-0.01, 0.02] Contentment 0.02±0.03 [0.00, 0.03] Desire 0.07±0.06 [0.04, 0.07] Disappointment 0.07±0.05 [-0.02, 0.00] Doubt 0.12±0.07 [-0.03, 0.00] Elation 0.03±0.05 [-0.03, 0.00] Interest 0.02±0.04 [-0.01, 0.01] Pain 0.09±0.08 [-0.01, 0.00] Triumph 0.07±0.07 [-0.01, 0.00]

Table S7 GPT-4-vision-image Zero-Shot vs. Human Participants: Independent Samples t-Test Results for Emotion Recognition

Emotion Label Cohen's d 95% CI Amusement 0.07±0.08 [-0.05, -0.01] Anger 0.04±0.07 [-0.10, -0.04] Sadness 0.11±0.17 [-0.03, 0.00] Surprise 0.04±0.07 [-0.03, 0.01] Awe 0.05±0.07 [-0.03, 0.00] Concentration 0.04±0.06 [0.01, 0.02] Confusion 0.05±0.09 [-0.03, 0.01] Contempt 0.03±0.05 [0.02, 0.08] Contentment 0.01±0.02 [0.02, 0.07] Desire 0.12±0.12 [-0.06, -0.02] Disappointment 0.09±0.11 [0.03, 0.06] Doubt 0.09±0.11 [-0.04, 0.01] Elation 0.07±0.08 [0.03, 0.07] Interest 0.06±0.07 [-0.05, 0.00] Pain 0.06±0.07 [-0.02, 0.02] Triumph 0.11±0.08 [-0.02, 0.00]

Table S8 GPT-4-vision-image Zero-Shot vs. Human Participants: Independent Samples t-Test Results for Emotion Inference

Emotion Label Cohen's d 95% CI Amusement 0.07±0.07 [-0.04, -0.02] Anger 0.03±0.05 [-0.07, -0.02] Sadness 0.09±0.13 [-0.01, 0.01] Surprise 0.04±0.05 [-0.11, -0.07] Awe 0.05±0.07 [0.01, 0.02] Concentration 0.05±0.05 [0.02, 0.06] Confusion 0.13±0.10 [-0.03, 0.01] Contempt 0.04±0.04 [0.04, 0.07] Contentment 0.02±0.03 [-0.03, -0.01] Desire 0.10±0.09 [0.00, 0.03] Disappointment 0.09±0.06 [0.05, 0.08] Doubt 0.06±0.07 [-0.02, 0.01] Elation 0.04±0.06 [0.02, 0.04] Interest 0.02±0.02 [-0.03, 0.01] Pain 0.08±0.10 [-0.02, 0.00] Triumph 0.07±0.05 [-0.01, 0.00]

Table S9 GPT-4-vision-text Zero-Shot vs. Human Participants: Independent Samples t-Test Results for Emotion Recognition

Emotion Label Cohen's d 95% CI Amusement 0.07±0.08 [-0.03, 0.01] Anger 0.06±0.08 [-0.06, 0.01] Sadness 0.11±0.17 [-0.03, 0.00] Surprise 0.08±0.12 [-0.01, 0.03] Awe 0.05±0.07 [0.02, 0.03] Concentration 0.03±0.05 [-0.08, -0.03] Confusion 0.05±0.09 [0.01, 0.06] Contempt 0.06±0.08 [-0.04, 0.00] Contentment 0.01±0.02 [-0.03, 0.01] Desire 0.12±0.12 [-0.08, -0.03] Disappointment 0.06±0.08 [0.01, 0.06] Doubt 0.09±0.11 [-0.04, 0.01] Elation 0.07±0.07 [0.03, 0.07] Interest 0.03±0.05 [-0.03, 0.01] Pain 0.06±0.07 [-0.02, 0.02] Triumph 0.11±0.07 [-0.03, -0.01]

Table S10 GPT-4-vision-text Zero-Shot vs. Human Participants: Independent Samples t-Test Results for Emotion Inference

Emotion Label Cohen's d 95% CI Amusement 0.07±0.07 [-0.02, 0.01] Anger 0.06±0.07 [-0.04, 0.01] Sadness 0.09±0.13 [-0.03, -0.01] Surprise 0.07±0.10 [-0.10, -0.06] Awe 0.05±0.07 [0.02, 0.03] Concentration 0.03±0.04 [-0.06, -0.03] Confusion 0.13±0.10 [0.02, 0.06] Contempt 0.05±0.06 [-0.01, 0.02] Contentment 0.02±0.03 [0.00, 0.03] Desire 0.09±0.06 [0.05, 0.08] Disappointment 0.07±0.05 [-0.03, 0.00] Doubt 0.12±0.07 [-0.03, 0.00] Elation 0.03±0.05 [-0.03, 0.00] Interest 0.02±0.04 [-0.01, 0.01] Pain 0.09±0.08 [-0.02, 0.00] Triumph 0.07±0.07 [-0.01, 0.00]

Figure S1 Example of MLLM Zero-Shot Emotion Recognition Prompt

Figure S2 Example of MLLM Zero-Shot Emotion Inference Prompt

Figure S3 Example of MLLM Repeated-Measurement Emotion Recognition Prompt

Figure S4 Example of MLLM Repeated-Measurement Emotion Inference Prompt

Figure S5 Claude-3-haiku Zero-Shot Emotion Recognition: Emotion Label-Dyadic Conversation Probability Distribution Matrix

Figure S6 Claude-3-haiku Zero-Shot Emotion Inference: Emotion Label-Dyadic Conversation Probability Distribution Matrix

Figure S7 GPT-4-vision Zero-Shot Emotion Recognition: Emotion Label-Dyadic Conversation Probability Distribution Matrix

Figure S8 GPT-4-vision Zero-Shot Emotion Inference: Emotion Label-Dyadic Conversation Probability Distribution Matrix

Figure S9 GPT-4-turbo Zero-Shot Emotion Recognition: Emotion Label-Dyadic Conversation Probability Distribution Matrix

Figure S10 GPT-4-turbo Zero-Shot Emotion Inference: Emotion Label-Dyadic Conversation Probability Distribution Matrix

Figure S11 GPT-4-vision-image Zero-Shot Emotion Recognition: Emotion Label-Dyadic Conversation Probability Distribution Matrix

Figure S12 GPT-4-vision-image Zero-Shot Emotion Inference: Emotion Label-Dyadic Conversation Probability Distribution Matrix

Figure S13 GPT-4-vision-text Zero-Shot Emotion Recognition: Emotion Label-Dyadic Conversation Probability Distribution Matrix

Figure S14 GPT-4-vision-text Zero-Shot Emotion Inference: Emotion Label-Dyadic Conversation Probability Distribution Matrix

Figure S15 Repeated-Measurement Spearman Correlation Analysis and Comparison for MLLM Emotion Recognition

Figure S16 Repeated-Measurement Spearman Correlation Analysis and Comparison for MLLM Emotion Inference

Submission history

[v1] 2025-09-10

Abstract

Full Text

Emotional Capabilities Evaluation of Multimodal Large Language Models in Dynamic Social Interaction Scenarios

Abstract

2.1 Evaluation Dataset

2.2 Design and Measurement of Emotion Recognition and Inference

2.3 Evaluation Methods

2.3.1 Human Participant Evaluation Method

2.3.2 MLLM Evaluation Method

2.4 Statistical Analysis Methods

2.4.1 Analysis Based on Emotion Label-Dyadic Conversation Probability Distribution Matrix

2.4.2 Analysis Based on Mean Emotion Label Probability Distribution

3 Results

3.1 Spearman Correlation Analysis Between MLLM Zero-Shot Performance and Human Participants

3.2 Independent Samples t-Tests Between MLLM Zero-Shot Performance and Human Participants

Discussion

References

Submission history

Access Paper

Citation

Share

Related Papers

Feedback

Emotional Capability Assessment of Multimodal Large Language Models in Dynamic Social Interaction Scenarios