Generative Large Language Models Empowering Psychometrics: Advantages, Challenges, and Applications
Xuetao Tian, Wenjie Zhou, Luo Fang, Zhihong Qiao, I am sorry, but the input provided ("丰怡") appears to be a name or a short fragment without any tags or academic context to translate. To provide a professional scientific translation, please provide the text enclosed in the required structural tags.
Submitted 2025-10-22 | ChinaXiv: chinaxiv-202510.00113 | Mixed source text

Abstract

Generative Large Language Models (LLMs) are artificial intelligence models pre-trained on large-scale corpora, bringing unprecedented opportunities and challenges to the field of psychometrics. By integrating the developmental trajectory of interdisciplinary research between artificial intelligence and psychology, this paper summarizes the significant advantages of LLMs in empowering psychometrics, identifies the critical challenges of LLMs in psychological applications, and proposes future research directions for psychometrics based on LLMs.

Specifically, LLMs can generate coherent natural language text based on context, possessing the potential to transform traditional test interaction methods. LLMs break through the limitations of processing ultra-long text and multimodal data, and their powerful content understanding capabilities enable the comprehensive acquisition and analysis of subjects' psychological information. Furthermore, LLMs facilitate real-time analysis and personalized feedback, promoting a shift from outcome evaluation to process evaluation. Although the practical application of LLMs faces challenges such as stability, creativity, and scalability, they demonstrate broad application prospects and research value in fields such as situational judgment test generation, collaborative problem-solving ability assessment, intelligent diagnosis and treatment of mental health, and item quality analysis.

Full Text

Preamble

Generative Large Language Models Empowering Psychometrics: Advantages, Challenges, and Applications

Affiliations:
Faculty of Psychology, Beijing Normal University, Beijing;
School of Education, University of California, Berkeley, Berkeley;
Psychological Counseling Center, Central Finance and Economics University, Beijing.

Abstract

The rapid development of Generative Large Language Models (GLLMs) is driving a profound transformation across various scientific disciplines. In the field of psychometrics, GLLMs offer innovative solutions for traditional challenges while introducing new methodological paradigms. This paper systematically explores how GLLMs empower psychometrics, focusing on three core dimensions: theoretical advantages, practical challenges, and diverse application scenarios. We discuss how these models enhance the efficiency of item generation, improve the accuracy of automated scoring, and enable the simulation of complex human behaviors. Simultaneously, we address critical concerns regarding validity, reliability, and ethical considerations in the era of AI-driven assessment. Finally, we outline future research directions for integrating GLLMs into psychometric frameworks to foster more robust and adaptive psychological measurement.

Introduction

Psychometrics, as the science of psychological measurement, has long relied on rigorous statistical methods and standardized instruments to quantify latent human traits. From Classical Test Theory (CTT) to Item Response Theory (IRT), the field has evolved to provide increasingly precise estimates of ability, personality, and mental health. However, traditional psychometric methods often face bottlenecks, such as the high cost of manual item development, the limitations of static assessments, and the difficulty of analyzing unstructured qualitative data at scale.

The emergence of Generative Large Language Models (GLLMs), such as the GPT series, has introduced a powerful new toolkit for addressing these limitations. By leveraging vast amounts of data and sophisticated neural architectures, GLLMs demonstrate remarkable capabilities in natural language understanding, reasoning, and content generation. This paper examines the intersection of GLLMs and psychometrics, evaluating how these technologies can be harnessed to advance the science of mental measurement.

Advantages of GLLMs in Psychometrics

Automated Item Generation and Optimization

One of the most immediate benefits of GLLMs is Automated Item Generation (AIG). Traditionally, creating high-quality test items requires significant subject-matter expertise and extensive pilot testing. GLLMs can generate a large volume of items based on specific constructs, difficulty levels, and cognitive demands. By providing the model with detailed prompts or "blueprints," researchers can produce diverse item banks that maintain content validity while reducing the time and labor costs associated

摘要

Introduction

Generative Large Language Models (LLMs) are artificial intelligence models pre-trained on massive corpora that present unprecedented opportunities and challenges for the field of psychometrics. By integrating the developmental trajectories of interdisciplinary research between artificial intelligence and psychology, this paper summarizes the significant advantages of LLMs in empowering psychometrics, identifies critical challenges in psychological applications, and proposes future research directions for LLM-based psychometric development.

Specifically, LLMs possess the capability to generate coherent natural language text based on context, offering the potential to transform traditional test interaction methods. Their ability to process long-form text and multimodal data, combined with powerful content understanding, enables the comprehensive acquisition and analysis of subjects' psychological information. This facilitates real-time analysis and personalized feedback, promoting a paradigm shift from outcome-based evaluation to process-oriented evaluation.

While the practical application of LLMs faces challenges regarding stability, creativity, and scalability, they demonstrate broad application prospects and research value in fields such as the generation of situational judgment tests, the assessment of collaborative problem-solving skills, intelligent diagnosis and treatment for mental health, and the quality analysis of test items.

关键词

Generative Large Language Models in Psychometrics: Artificial Intelligence and Automated Assessment in Interactive Testing

Abstract

The rapid advancement of Generative Large Language Models (LLMs) is catalyzing a paradigm shift in the field of psychometrics. This paper explores the integration of artificial intelligence into automated assessment frameworks, specifically focusing on the transition from static evaluations to dynamic, interactive testing environments. By leveraging the natural language processing capabilities of LLMs, psychometricians can now develop sophisticated scoring algorithms and item generation techniques that maintain high levels of validity and reliability while significantly reducing human labor. We discuss the theoretical foundations of AI-driven assessment, the technical implementation of interactive testing interfaces, and the ethical considerations surrounding algorithmic bias and data privacy in psychological measurement.

Introduction

Psychometrics, the science of measuring mental capacities and processes, has traditionally relied on standardized instruments such as multiple-choice questions and Likert scales. While these methods offer high reliability, they often struggle to capture the complexity of human cognition and behavior in real-world scenarios. The emergence of generative artificial intelligence, particularly Large Language Models (LLMs) like GPT-4, provides a transformative opportunity to bridge this gap. By enabling automated assessment and interactive testing, these models allow for a more nuanced evaluation of constructs such as critical thinking, creativity, and social intelligence.

Theoretical Framework for AI-Driven Assessment

The integration of machine learning into psychometrics necessitates a re-evaluation of classical test theory (CTT) and item response theory (IRT). In an AI-driven context, the "item" is no longer a static prompt but a dynamic interaction.

  1. Automated Item Generation (AIG): LLMs can be prompted to generate vast pools of test items based on specific cognitive blueprints. This ensures content validity while preventing item exposure and cheating.
  2. Automated Scoring: Beyond simple keyword matching, deep learning models can evaluate open-ended responses by understanding semantic meaning, coherence, and latent psychological traits.
  3. Interactive Testing: Unlike traditional Computerized Adaptive Testing (CAT), interactive testing involves a back-and-forth dialogue between the examinee and the AI agent, allowing the assessment to probe deeper into the respondent's reasoning process.

Methodology: Implementing Interactive Psychometric Models

To implement an interactive assessment, we utilize a multi-layered architecture where the LLM serves as both the administrator and the evaluator.

[FIGURE:1]

As illustrated in [FIGURE:1], the system architecture consists of a prompt

1 引言

Introduction

Psychometrics, as one of the foundational fields of psychological research, is dedicated to developing and refining tools and methods for assessing individual psychological traits. With societal progress and technological advancements, psychometrics faces new challenges, specifically regarding how to achieve a comprehensive improvement in the speed, precision, and ecological validity of psychological trait assessment \cite{2021}. To meet these modern requirements, artificial intelligence (AI) technology has emerged as a vital force in advancing the field.

For instance, researchers have introduced automated item generation based on machine learning and natural language processing into the test construction process to enhance efficiency \cite{Gotz 2023; Hommel 2022; Laverghetta Licato, 2023}. In the context of item response theory (IRT), these technologies are being utilized to optimize parameter estimation and improve the accuracy of latent trait modeling. By leveraging these advanced computational techniques, the field is moving toward more dynamic and responsive assessment frameworks that can better capture the complexities of human behavior and mental processes in real-world settings.

Xuetao Tian and Wenjie Zhou contributed equally to this work and are co-first authors.

Recent research has integrated machine learning methods into psychometric measurement models, such as Item Response Theory (IRT) and Cognitive Diagnostic Models (CDM), to improve the precision of individual trait identification \cite{Bergner2012, nez-Plumed2016, Pliakos2019, 2023}. However, the primary limitation of existing assessment methods that incorporate artificial intelligence lies in their heavy reliance on large volumes of high-quality labeled data. Whether used to guide test development or to train scoring models, the data acquisition process is both time-consuming and labor-intensive \cite{Ersozlu2024}. Furthermore, due to constraints in data volume, the generalization capabilities of these models are often poor; they frequently perform well on one specific test but poorly on others, failing to adapt to new task requirements \cite{Janiesch2021}.

The rapid development of Generative Large Language Models (Generative LLMs, or LLMs) presents new opportunities and transformative potential for psychometrics. LLMs are an artificial intelligence technology pre-trained on massive corpora, capable of capturing complex contextual semantic information and supporting fine-tuning optimization for specific scenarios \cite{Zhao2023}. Conducting psychometric research based on LLMs facilitates comprehensive intelligence across data acquisition, analysis, and feedback, making psychological assessment more efficient and precise. Additionally, LLMs can generate highly flexible and diverse natural language, providing significant imaginative space for the evolution of psychometrics. Whether LLMs can replace test developers, proctors, scorers, feedback providers, or even test-takers has sparked extensive exploration and debate within the field \cite{Buongiorno2024, Goretzko2022, Pellert2024}.

By reviewing the developmental history of psychometrics and its interdisciplinary research with artificial intelligence, this paper identifies the significant advantages of LLMs in empowering psychometric research and application, particularly regarding interaction modes, content understanding, and scoring methods. At the same time, it notes that LLMs still face technical challenges in terms of stability, creativity, and scalability.

Furthermore, by simulating trait-behavioral contexts, constructing standardized agent interactions, implementing dynamic emotional recognition dialogues, and simulating expert reasoning, the integration of LLMs with psychometrics shows promising application prospects. Key breakthrough areas include the generation of Situational Judgment Tests (SJTs), the assessment of Collaborative Problem Solving (CPS) skills, intelligent diagnosis and treatment in mental health, and the quality analysis of test items.

Generative Large Language Models (LLMs) are a class of artificial intelligence models trained on massive text datasets and large-scale corpora. Their core capability lies in understanding context and generating coherent, natural language text.

In the initial stages of development, vast quantities of data without task-specific labels are utilized. This approach leverages the power of self-supervised learning to extract fundamental patterns and representations from the raw data. By processing these massive datasets, the model can develop a broad understanding of the underlying structures—whether in natural language, images, or other modalities—before being fine-tuned for specific downstream applications. This foundational phase is critical for building robust models that can generalize across a wide variety of tasks with minimal additional supervision.

Pre-Training

In the field of machine learning, pre-training refers to the process of training a model on a large-scale dataset to learn general feature representations before fine-tuning it for a specific downstream task. This approach has become a cornerstone of modern deep learning, particularly in natural language processing (NLP) and computer vision (CV). By leveraging vast amounts of unlabeled or broadly labeled data, pre-training allows models to capture fundamental patterns, structures, and semantics that are transferable across different domains.

The Mechanism of Pre-Training

The core philosophy behind pre-training is transfer learning. Instead of initializing a model with random weights—which often requires an enormous amount of task-specific labeled data to converge—pre-training provides a "warm start." During this phase, the model is typically trained using self-supervised learning objectives. For instance, in NLP, models like BERT or GPT are trained to predict masked words or the next token in a sequence. Through these tasks, the model develops a sophisticated understanding of syntax, grammar, and even world knowledge.

Advantages of Pre-Training

Pre-training offers several critical advantages in academic and industrial applications:

  • Reduced Data Requirements: Fine-tuning a pre-trained model requires significantly fewer labeled examples than training a model from scratch. This is particularly vital for specialized fields where high-quality labeled data is scarce or expensive to obtain.
  • Improved Generalization: Models that have seen a diverse range of data during pre-training tend to generalize better to unseen data and are more robust against noise.
  • Computational Efficiency: Although the initial pre-training phase is computationally intensive, the resulting "foundation model" can be adapted to numerous downstream tasks with relatively low additional computational cost.

Applications and Evolution

The evolution of pre-training has led to the emergence of "Foundation Models." In computer vision, models pre-trained on datasets like ImageNet have long been used as feature extractors for object detection and segmentation. In natural language processing, the shift from word embeddings (like Word2Vec) to contextualized representations (like RoBERTa and T5) has revolutionized the state-of-the-art across nearly all benchmarks. Recently, multi-modal pre-training—which involves training on both images and text simultaneously—has enabled models to perform complex tasks such as image captioning and visual question answering with unprecedented accuracy.

The model is trained using extensive text data. This process is designed to enable the model to learn universal linguistic patterns, factual knowledge, and semantic relationships. Subsequently, a smaller dataset specifically related to a target task is used for fine-tuning.

Fine-Tuning

Fine-tuning is a critical technique in deep learning where a model pre-trained on a large-scale dataset is further trained on a smaller, domain-specific dataset. This process allows the model to leverage the general features learned during the initial pre-training phase while adapting its parameters to the nuances of a specific target task. By starting from a pre-trained state rather than random initialization, fine-tuning significantly reduces the computational resources and time required for convergence, while often achieving superior performance on tasks with limited labeled data.

In practice, fine-tuning typically involves replacing the final output layer of the pre-trained network with a new layer tailored to the number of classes or the specific output format of the target task. During the training process, the learning rate is usually set to a much smaller value than that used during pre-training to prevent the destruction of the previously learned representations. Depending on the size of the target dataset, one may choose to freeze the weights of the earlier layers (feature extraction) and only update the later layers, or update all parameters across the entire network (full fine-tuning). This flexibility makes fine-tuning an essential strategy for deploying state-of-the-art models in specialized fields such as medical imaging, natural language processing, and autonomous driving.

The dataset is used to further train the model, enabling it to better adapt to and complete specific tasks. A foundational architecture in deep learning models is the Transformer, which serves as the technical cornerstone of generative large language models. The Transformer architecture utilizes the Attention Mechanism, allowing the model to weigh the importance of different words when processing text. This capability facilitates a profound understanding of long-distance dependencies and complex contextual relationships.

The model is capable of simultaneously processing and understanding multiple diverse data types (Multimodal), such as text, images, audio, and video. Multimodal learning enables the deep integration and fusion of information derived from these disparate sources.

Users only need to provide a small number of task examples or clear instructions within the prompt to achieve the desired output.

In-Context Learning

In-Context Learning (ICL) has emerged as a core paradigm for utilizing Large Language Models (LLMs). Unlike traditional supervised learning, which requires explicit parameter updates through backpropagation, ICL allows models to perform new tasks by simply providing a few examples or instructions within the input prompt. This capability demonstrates the remarkable adaptability of transformer-based architectures when scaled to billions of parameters.

1. Definition and Mechanism

At its core, In-Context Learning refers to the ability of a pre-trained model to learn from the "context" provided in the prompt. A typical ICL prompt consists of a task description, several input-output pairs (demonstrations), and a new query. The model then predicts the completion for the query by identifying patterns within the provided examples.

Mathematically, given a set of demonstrations $C = {(x_1, y_1), (x_2, y_2), \dots, (x_k, y_k)}$ and a new input $x_{test}$, the model aims to predict:
$$P(y_{test} | C, x_{test})$$
where the model parameters $\theta$ remain fixed. This process bypasses the need for task-specific fine-tuning, making it highly efficient for rapid prototyping and deployment across diverse domains.

2. Key Components of ICL

The performance of In-Context Learning is highly sensitive to the configuration of the prompt. Several factors play a critical role in its effectiveness:

  • Demonstration Selection: The choice of examples $(x_i, y_i)$ significantly impacts accuracy. Selecting examples that are semantically similar to the test query often yields better results.
  • Example Ordering: Research has shown that the order in which demonstrations are presented can lead to drastic variations in performance, a phenomenon sometimes referred to as "prompt instability."
  • Formatting and Templates: The way instructions are phrased and the delimiters used to separate examples can influence the model's ability to parse the task structure.

3. Theoretical Perspectives

Despite its empirical success, the underlying mechanics of why ICL works remain a subject of active research. Several prominent theories have been proposed:

  1. Implicit Fine-tuning: Some researchers suggest that the forward pass of a Transformer can be viewed as an implicit gradient descent process, where the attention mechanism updates internal representations similarly to how weights are updated during training.
  2. **Bayesian

By providing specific instructions, the model can adapt to and solve new, similar tasks without requiring any permanent modifications to the model itself. Chain-of-Thought (CoT) is a technique used to facilitate more complex reasoning. By requiring the model to "think step-by-step" within the prompt and demonstrate its reasoning process, its accuracy in logic, mathematics, and complex reasoning tasks can be significantly improved.

Unstructured data is mapped into a high-dimensional mathematical space through the use of specific encoders. Within this Embedding Space, concepts from different modalities that share semantic relevance can be represented by similar numerical vectors. For example, the word "apple" and an image of an apple will be positioned very closely to one another within this space.

2 生成式

The Advantages of Empowering Psychometrics

The integration of advanced computational techniques, particularly machine learning and deep learning, into the field of psychometrics has catalyzed a significant paradigm shift. By leveraging these technologies, researchers can overcome the traditional limitations of classical test theory and item response theory, leading to more precise, efficient, and scalable psychological assessments. The following sections outline the core advantages of empowering psychometrics with modern computational approaches.

1. Enhanced Precision and Validity

Traditional psychometric models often rely on linear assumptions and a limited number of variables. In contrast, machine learning algorithms can capture complex, non-linear relationships within high-dimensional data. This capability allows for the identification of subtle patterns in behavioral data that traditional methods might overlook. By incorporating diverse data sources—such as response latencies, process data, and even physiological signals—computational psychometrics enhances the ecological validity of assessments, ensuring that the constructs being measured more accurately reflect real-world psychological states.

2. Adaptive and Personalized Assessment

One of the most significant advantages is the advancement of Computerized Adaptive Testing (CAT). While CAT has existed for decades, modern algorithms allow for more sophisticated item selection strategies. By utilizing reinforcement learning and deep generative models, assessments can dynamically adapt to a test-taker's ability level in real-time with unprecedented efficiency. This reduces the "test-taker burden" by minimizing the number of items required to achieve a specific level of measurement precision, thereby preventing fatigue and maintaining high levels of engagement.

3. Automated Content Generation and Scoring

The application of Natural Language Processing (NLP) has revolutionized the way psychological instruments are developed and evaluated. Large Language Models (LLMs) can assist in the automated generation of test items, ensuring they meet specific semantic and difficulty criteria. Furthermore, automated scoring systems for open-ended responses, essays, and clinical interviews provide a level of consistency and scalability that human raters cannot match. This reduces subjective bias and allows for the rapid processing of large-scale assessments in educational and organizational settings.

4. Real-time Monitoring and Longitudinal Analysis

Empowered psychometrics facilitates "passive sensing" and continuous monitoring through mobile and wearable devices. Unlike traditional "snapshot" assessments that capture a single point in time, these technologies allow for the collection of longitudinal data. This enables researchers to track fluctuations in mental health, cognitive performance, or personality states over time. Advanced time-series analysis and recurrent neural networks can then be used to predict future psychological outcomes, providing opportunities for early

2.1 生成式

Transforming Interaction in Psychological Testing

With the advancement of technology, the interaction formats of psychological testing have undergone significant transformations, evolving from paper-and-pencil tests to computerized assessments, and now to modern conversational intelligence technologies that allow individuals to interact with computers through dialogue. This section briefly introduces the historical development and characteristics of testing formats, exploring how new interaction methods driven by Large Language Models (LLMs) are advancing the psychometric paradigm.

1) Paper-and-Pencil Psychological Testing

Paper-and-pencil testing has a long history and represents the earliest form of psychological assessment, utilizing paper and pen as the primary media for interaction. It is widely recognized in the field of psychology that Binet and Simon constructed the world's first modern intelligence test in 1905 \cite{Boake2002, Matarazzo1992}. However, records indicate that paper-based ability tests were used extensively in China long before this \cite{Yan2020}. Paper-and-pencil tests typically follow a fixed format where examinees answer a pre-set sequence of items. These item formats include objective types such as multiple-choice, true/false, matching, and fill-in-the-blank, as well as open-ended subjective questions requiring written descriptions \cite{Berry2008}.

2) Computerized Psychological Testing

With the proliferation of computers and the internet, computerized testing has made interaction methods more flexible. Beyond the conventional items found in paper-and-pencil tests, computerized assessments allow for more complex items that better align with real-world scenarios. For example, the Programme for International Student Assessment (PISA) added a problem-solving test in 2012, utilizing word problems that mirror life scenarios to provide more possibilities for assessing high-order abilities \cite{OECD2013}. In 2015, it employed fixed human-computer interaction items to achieve gamified assessment of collaborative problem-solving skills \cite{OECD2017}. Gamified assessment and game-based assessment (GBA) are also significant products of the evolution of computerized testing. By integrating the fun and engagement of games, these methods can effectively reduce test anxiety \cite{DeRosier2019, Mavridis2017}. Process data and scoring data generated within games can be used to evaluate a subject's personality traits and cognitive abilities, demonstrating substantial potential for assessment applications \cite{Haizel2021, Landers2016, Lumsden2018}. Furthermore, Computerized Adaptive Testing (CAT), which combines computer technology with Item Response Theory (IRT), can instantaneously select items suited to an examinee's ability level based on their responses. This reduces the number of unnecessary items, shortens testing time, improves efficiency, and enables equating across different items and time points.

CAT is currently widely used in large-scale international examinations such as the GRE (Graduate Record Examinations), GMAT (Graduate Management Admission Test), and Duolingo English Test.

3) Psychological Testing Empowered by Generative Large Language Models

Human-computer interaction driven by LLMs will profoundly influence psychometric applications. In previous computerized psychological tests, computers primarily received human information through command-based interaction modes. For instance, in game-based assessments, examinees convey their decisions to the computer via a mouse or keyboard.

While this interaction method is effective, its limited operational space makes it difficult for the resulting data to comprehensively reflect the subject's psychological traits and decision-making processes. With LLMs, the interaction mode of psychological testing will become more natural and spontaneous. A significant advantage of LLMs lies in their ability to conduct natural language interaction, thereby greatly expanding the computer's capacity to capture human psychological information. Through dialogue with the subject, LLMs can simultaneously extract various linguistic features such as tone, semantics, and sentence structure \cite{Kjell2024, Li2023}. This not only helps identify the subject's emotional state, cognitive load, and motivation levels but also captures more nuanced psychological changes through multi-turn dialogues. For example, when dealing with stress, LLMs can evaluate a subject's linguistic performance under different stress levels through continuous conversation, thereby more accurately measuring their coping strategies and psychological resilience.

By leveraging robotic agents, LLMs can also simulate diverse scenarios and dynamically adapt to different testing requirements. This new interaction model significantly enhances the flexibility of psychological test administration.

LLM-driven dialogues and robotic agents can engage in deep conversation with subjects by playing various social roles—varying by age, profession, and educational background. This proactively stimulates the subject's psychological reactions, thereby obtaining richer psychological information and achieving more precise and personalized psychological assessment \cite{Cui2024, Kharitonova2024}. This shift is driving psychometrics to evolve from traditional testing models toward more intelligent and ecological directions.

2.2 生成式

Breakthroughs in Content Understanding for Psychological Testing

As psychometrics continues to evolve, researchers are faced not only with the processing of structured data but also with the necessity of effectively managing the complexities of unstructured data, such as interview transcripts, counseling dialogues, and audiovisual materials. These unstructured data sources typically contain rich semantic information and emotional expressions, which are of significant importance for the in-depth analysis of an individual's psychological state, behavioral patterns, and emotional shifts. Consequently, the efficient understanding and processing of such unstructured data have become a key challenge in technological development. This section will briefly review the developmental trajectory of artificial intelligence in content understanding and its applications in psychometrics, while further elucidating the breakthroughs of generative models in processing ultra-long texts and understanding multimodal data.

Breakthroughs in Long-Text Understanding

Textual data represents one of the most common and easily collected data types in psychological research.

In psychometric research, text analysis has emerged as a vital methodology. Mining information from textual data enables the analysis of linguistic features associated with different mental health states \cite{Eichstaedt 2018; 2022}, supports personality prediction \cite{Majumder 2017; Rahman 2019; 2021}, and facilitates the assessment of social-emotional constructs.

分析

Understanding individual differences through text analysis is built upon the foundation of text representation (Antypas 2023; Vosoughi 2018). Its technical development has progressed through four distinct stages: word-based, topic-based, word vector-based, and pre-trained language models. Traditional text representation techniques, such as Bag-of-Words and TF-IDF (Term Frequency-Inverse Document Frequency), were among the earliest technologies applied to text data processing. These models perform basic statistical analysis and feature extraction by representing text as an unordered collection of words. Although widely used, these methods have significant limitations in their ability to handle long texts and complex semantics; they fail to capture contextual relationships between words, making it difficult to understand polysemy, synonyms, and context-dependent text (Asudani 2023; Ludwig 2021; Zhang 2010). To overcome these deficiencies, distributed models were introduced. These models represent words as high-dimensional vectors to capture semantic relationships, marking a significant advancement in text processing. Latent Semantic Analysis (LSA) is a representative example of these distributed models.

LSA maps high-dimensional text data into a low-dimensional latent semantic space by performing Singular Value Decomposition (SVD) on a word-frequency matrix, thereby revealing the underlying semantic relationships between texts and words (Deerwester 1990). To address issues such as the neglect of word order, logic, or morphology (Landauer 1998), models like Word2Vec and GloVe were proposed, which greatly enhanced the capacity for semantic understanding. These advancements allowed researchers to more accurately analyze latent semantic differences in open-ended responses (Dipietro 2008; Jatnika 2019; Zhang 2015; Pennington 2014; Uymaz & Metin 2022), consequently improving the accuracy of psychological assessment results (Foltz 2023; Sonabend 2020). However, as the complexity of data scenarios increased, these methods still struggled to meet the demands of processing ultra-long text data, particularly when deep contextual understanding and the integration of information from multiple data sources were required.

The introduction of Pre-trained Language Models (PLMs) marked a new phase in natural language processing. Building on distributed models, PLMs can dynamically adjust word representations based on context, enabling the capture of complex semantic relationships. ELMo (Embeddings from Language Models) is a representative example that utilizes Long Short-Term Memory (LSTM) networks to capture the dynamic changes of words within a context, ensuring that each word's representation depends not only on itself but also on its surrounding text (Peters 2018). While this dynamic embedding method significantly improved performance across various NLP tasks, the sequential processing nature of LSTMs is inefficient for long texts and difficult to parallelize. BERT (Bidirectional Encoder Representations from Transformers) achieved deep contextual understanding by introducing the Transformer architecture and a bidirectional encoding mechanism (Vaswani 2017). Unlike traditional models, BERT employs Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks during pre-training, allowing it to excel in a wide range of NLP tasks (Vaswani 2017). Conversely, the GPT (Generative Pre-trained Transformer) models focus on generating coherent subsequent text from a given input. GPT uses a unidirectional Transformer to generate text through sequence prediction tasks (Radford 2018). Although this unidirectional structure has certain limitations when handling bidirectional contextual dependencies, its generative capabilities make it outstanding for many generative tasks.

Context-aware models, represented by GPT-1, learn fundamental language abilities through large-scale unlabeled text data during the pre-training phase. In the fine-tuning phase, these models utilize labeled data to adapt to specific tasks, better addressing downstream NLP tasks such as cloze tests or summarization. Subsequently, a series of pre-trained language models were developed; however, prior to the advent of Large Language Models (LLMs), these models were consistently constrained by input text length, making them difficult to apply in ultra-long contextual environments.

LLMs such as GPT-3/4 and LLaMA have provided new avenues for text analysis in psychological research, particularly for long-form data (Acheampong 2021). Unlike early pre-trained models, LLMs no longer rely on fine-tuning with downstream task data to solve specific problems. Instead, they utilize In-Context Learning (ICL) to solve new NLP tasks using a few examples provided during the interaction process (Brown 2020). The leap in long-text understanding is primarily due to synergistic progress in algorithms, encoding, and hardware. At the algorithmic level, researchers developed more efficient attention mechanisms, such as FlashAttention (2022), to reduce computational complexity and enable the processing of longer text data. Regarding encoding, techniques like Relative Position Embedding and Attention with Linear Biases (ALiBi) were proposed to enhance the model's extrapolation capabilities for unseen long texts. At the hardware level, the development of large-capacity GPU memory, tensor parallel computing, and memory optimization strategies provided the physical foundation for running large models with extended context windows. This series of technical advancements has driven an exponential growth in context length; for instance, OpenAI's context window has expanded rapidly from 2,048 tokens in GPT-3 to 8k/32k in GPT-4, and further to 128k in GPT-4o, enabling the model to process an entire book of 100,000 words in a single pass.

LLMs demonstrate powerful capabilities in ultra-long text understanding. Through massive pre-training corpora and complex network architectures, they maintain semantic consistency over long sequences and capture subtle semantic shifts in multi-turn dialogues or long narratives, thereby maintaining superior contextual coherence. Furthermore, by pre-training on ultra-large-scale text data, LLMs have accumulated rich world knowledge. Compared to traditional machine learning models, these LLMs not only understand broad and diverse contexts but also exhibit a profound grasp of world knowledge during task execution. According to the CompassRank leaderboard, the average scores of GPT-4o on the New Curriculum Standard Gaokao (China's National College Entrance Examination) exceeded the undergraduate admission line for Guangdong Province, with the Chinese language subject reaching an elite level. This demonstrates that the latest generation of LLMs possesses both a rich knowledge base and general-purpose problem-solving capabilities. Even without optimization for specific downstream tasks, they still exhibit performance far exceeding that of traditional models. This versatility allows LLMs to gradually replace specific solutions for many traditional tasks in NLP, such as machine translation and text retrieval, while simultaneously opening new research paradigms for psychology.

LLMs have shown advantages in identifying mental health status and political orientation (Acheampong 2021; Brady 2021; 2022; 2024; 2023). When facing complex tasks, LLMs also exhibit strong reasoning capabilities, solving problems involving intricate knowledge relationships and mathematical reasoning (2024; Huang & Chang 2023). Their ability to understand long texts allows them to be applied in psychometric scenarios to capture the true intentions of subjects expressed in natural language and to execute complex task instructions (Rathje 2024), providing technical support for high-ecology measurement tasks.

Breakthroughs have also occurred in multimodal data understanding. As technology evolves, psychometrics is gradually expanding to collect and analyze various complex and diverse multimodal data. This includes text data from interview transcripts and open-ended questions; process data such as movement paths, click behavior, and response times from virtual reality or gamified tests; physiological data like heart rate, galvanic skin response, EEG, and respiration rate collected by biosensors and wearables; voice data including pitch, speed, volume, pauses, and emotional expression; and visual information from video recordings such as facial expressions, body posture, and eye-tracking. In psychometrics, multimodal data provides richer individual information, which is of significant value especially when assessing complex psychological traits (2020; Obrenovic & Starcevic 2004; Palumbo 2020; Sharma & Giannakos 2020). However, effectively integrating and analyzing these different modalities has remained a challenge in psychology. With the widespread application of speech analysis and computer vision, the capacity to understand multimodal data in psychometrics has been greatly enhanced. For example, speech analysis technology can identify and analyze pronunciation, intonation, and tone, and is already widely used in language tests such as the TOEFL and Mandarin Proficiency Test (Huawei & Aryadoust 2023; Palanivinayagam 2023). Computer vision has also played a vital role in automated interview scoring and movement assessment based on audio and video (Debnath 2022; Haizel 2021; Silva 2021; Zhang 2024). These applications not only improve the intelligence of psychological testing but also support the diversification of measurement methods and the precision of results. However, current multimodal understanding is largely limited to the feature extraction level, making it difficult to achieve true multimodal interaction between humans and computers.

The emergence of Multimodal Large Language Models (MLLMs) provides new solutions for the understanding and integration of multimodal data. Existing multimodal research generally adopts feature fusion methods—for instance, simultaneously extracting non-verbal information such as facial expressions, body movements, and vocal tones alongside text data to form a psychological profile of a subject (Wang 2024). However, this approach struggles to capture the intricate relationships between different modalities. By combining specialized multimodal encoders with a shared Transformer backbone, MLLMs map various inputs like images and audio into the same semantic embedding space, achieving deep fusion and collaborative reasoning across modalities (Wang 2025). Models such as GPT-4 and PaLM2 can already process text and visual information simultaneously in a conversational format, performing excellently across numerous multimodal tasks and successfully extracting meaningful psychological trait features (2024; 2023). As psychometrics continues to evolve, researchers will face more challenges in processing unstructured data that contains rich semantic information and emotional expression, which are crucial for a deep understanding of individual psychological states and behavioral patterns.

Breakthroughs in multimodal data understanding provide a new path for addressing these challenges.

2.3 生成式

Broadening Psychometric Scoring Methods

With the advancement of psychometrics, scoring and evaluation methods have undergone significant transformations. First, the primary agent of evaluation has shifted from a reliance on human graders toward automated computer scoring. This transition has improved scoring efficiency and consistency while reducing misjudgment and unfairness. More importantly, the focus of evaluation has gradually shifted from a traditional result-oriented approach to a process-oriented one. Process-oriented evaluation enables real-time assessment and adjustment of test items at the measurement level, while simultaneously collecting and analyzing process data to provide a more comprehensive measurement of an individual's psychological state. From a temporal perspective, process evaluation integrates information from multiple observations to establish a dynamic profile of an individual's psychological development, which better facilitates personal growth. This section focuses on typical application scenarios of psychometrics in educational evaluation, tracing the development of educational test scoring technologies to explore the critical role played by current psychometric evaluation methods.

1.1 Technical Development from Subjective Human Scoring to Automated Scoring

The scoring methods of educational tests have undergone a significant transformation from a dependence on human scoring to automated scoring. In early paper-and-pencil tests, scoring primarily relied on the manual review of examinees' answers by teachers or evaluators. Whether for objective or subjective items, graders assigned scores based on preset standards. This approach had significant limitations, including subjective bias, issues with inter-rater consistency, and heavy workloads \cite{2014}. To address these issues, computer technology was gradually introduced into the field of psychometric scoring.

Initial automated scoring technologies were mainly applied to objective items, relying on Optical Character Recognition (OCR) to quickly identify and score large volumes of multiple-choice and true-false questions \cite{Alomran & Chai, 2018; McKenna, 2019; Memon, 2020}. While this method greatly improved scoring efficiency, it remained limited to objective assessments. With progress in Natural Language Processing (NLP) and machine learning, automated scoring gradually expanded to subjective items. Early text similarity measurement methods utilized features such as Bag-of-Words (BoW) and TF-IDF to achieve basic automated scoring by calculating the similarity between an examinee's response and a standard answer \cite{Ramnarain-Seetohul, 2022; Wang, 2022}. However, these methods struggled when processing complex semantics \cite{Dai, 2024}. With the introduction of supervised learning algorithms—such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and sequence regression—automated scoring technology advanced further. By training on large amounts of labeled data, these models can score more complex subjective items, significantly improving accuracy and consistency \cite{Chen & Zhou, 2019; Liang, 2018}. Nevertheless, these models depend on vast quantities of high-quality training data and still show limited performance when dealing with diverse item types and data \cite{Devine, 2023}. Large Language Models (LLMs) can not only process complex natural language text but also possess excellent semantic understanding capabilities, demonstrating great potential for scoring subjective items.

Models such as GPT-3/4, pre-trained on massive corpora, can accurately discern subtle semantic differences in answers, thereby achieving more precise test scoring \cite{Fernandez, 2022; 2024; Ludwig, 2021; 2024; Takano & Ichikawa, 2022; Yancey, 2023; 2022}. In educational testing empowered by LLMs, the system does not merely receive information passively; it can also autonomously adjust the assessment process based on examinee feedback. It can analyze responses in real-time and dynamically adjust the difficulty, content, or context of questions \cite{Chiu, 2024; Velthoven, 2018; Zhang, 2024}. This adaptive assessment process is expected to capture examinees' latent traits more effectively. In cognitive ability tests, LLMs can adjust item complexity based on an examinee's immediate performance, ensuring that assessment results truly reflect their ability level \cite{2024}. Such adaptive testing not only improves efficiency but also reduces examinee fatigue and anxiety, leading to more reliable data. Furthermore, LLMs can promote student progress through automated feedback. These models can generate detailed feedback to help examinees understand their deficiencies and provide suggestions for improvement, thereby enhancing learning outcomes \cite{Alomran & Chai, 2018; Gabbay & Cohen, 2024; Shaik, 2022}. Feedback systems powered by LLMs have shown significant performance in improving individual positive emotions and creativity \cite{Dong, 2024; Meyer, 2024; Sharma, 2023; Stamper, 2024}.

1.2 Technical Development from Result Evaluation to Process Evaluation

Traditional educational testing typically focuses on result evaluation, which emphasizes the assessment of the examinee's final answer. Whether for multiple-choice, true-false, or subjective open-ended questions, the grader's task is to evaluate the response based on a preset correct answer or scoring rubric \cite{Berry, 2008; Landauer, 1998}. This result-oriented evaluation often ignores the thinking processes, emotional reactions, and behavioral patterns demonstrated by the examinee during the task \cite{Schulte-Mecklenbeck, 2011}. Consequently, while result evaluation can provide a score for final performance, it fails to comprehensively reflect the examinee's psychological state and cognitive processes during the test. This limitation is particularly evident in assessments requiring the evaluation of complex psychological constructs, such as problem-solving strategies. With the popularization of computer technology and advancements in psychometric models, process evaluation has gradually become an essential component of testing. Process evaluation focuses not only on the final answer but also integrates various data generated during the testing process, such as response times, mouse clicks, and eye-tracking trajectories. These process data provide detailed information regarding the examinee's response behavior, such as their thinking paths and cognitive load \cite{2020; Jiao, 2023; 2019; 2022}. By analyzing process data, researchers can more comprehensively assess the examinee's psychological state and behavioral patterns. In problem-solving tests, process evaluation considers not only whether the correct answer was given but also analyzes the steps, response times, and sequence of operations to infer problem-solving strategies and cognitive complexity \cite{Chen, 2020; 2019; 2021}. With the emergence of Multi-modal Large Language Models (MLLMs), which can simultaneously process and understand data in various forms such as text, images, and audio, this cross-modal processing capability allows psychological assessment to evaluate an individual's psychological state more comprehensively and accurately \cite{Dong, 2024}. MLLMs can synthesize process-based multimodal information—such as written responses, vocal tone, and facial expressions—to comprehensively evaluate emotional states and mental health conditions \cite{2023; 2024; Zhang, 2024}. In summary, the introduction of LLMs has greatly advanced the development of psychometric scoring methods, and MLLM technology is currently driving the evolution of testing.

3 生成式

Challenges in Empowering Psychometrics

The advantages of Generative Large Language Models (LLMs) in interaction modalities, content understanding, and evaluation methodologies can play a significant role in the field of psychometrics, bringing numerous innovations to psychometric research and application. However, it must be recognized that current technological developments still face several limitations. This section will primarily discuss the challenges existing in areas such as stability, creativity, scalability, ethics, data security, and operational costs. Furthermore, it will summarize current research approaches to addressing these issues to better evaluate the possibilities of LLM-empowered psychometrics.

3.1 LLMs

Stability Issues in Generative Large Language Models

Generative Large Language Models (LLMs) have demonstrated immense potential for application; however, stability issues remain one of the greatest obstacles to their widespread adoption. These issues encompass inconsistent content output, occasional loss of context, factual or common-sense errors, and cultural or linguistic biases. First, the output of LLMs is often sensitive to minor variations in input, leading to inconsistent results. Huang et al. (2024) examined the consistency retention rate of responses before and after receiving perturbed inputs, finding that even for high-performing tasks, output accuracy decreases. For instance, the accuracy of PaLM2 declines under such conditions, suggesting that subtle nuances in input can lead to a loss in model performance. Similarly, \cite{2024a} found that when input content is paraphrased while maintaining the same semantics, the consistency rate of LLM outputs is only 20%. This inconsistency is particularly unreliable for psychological assessments that require high standardization. For example, in automated scoring tasks, an LLM might assign different scores to two identical or highly similar answers. Such inconsistency poses a challenge to the fairness and reliability of assessments. Simply changing the order of answers within an evaluation template can tamper with or distort the LLM's evaluation. Research indicates that models exhibit higher scoring tendencies for answers in specific positions, longer answers, or answers that resemble the model's own generated content \cite{2023, Zheng 2023}. To address output consistency, \cite{2024b} proposed Instruction-Augmented Supervised Fine-Tuning (SFT) and Consistency Alignment Training (CAT). In the SFT stage, the model first generates multiple different expressions of the original instruction; these rewritten instructions are then paired with the original training data to form new augmented training samples. Through this method, the model not only achieves better generalization across diverse instructional expressions but also accurately understands their core semantics. In the CAT stage, the model optimizes the generation of consistent and expected responses by scoring multiple generated answers, thereby further improving the diversity and consistency of the generated output. \cite{2023} proposed Balanced Position Calibration and Multiple Evidence Calibration.

The former calculates an average score by repeatedly changing the order of answers, while the latter generates multiple evaluation results for the same set of answers and integrates these results to obtain a more stable and accurate final score.

Context loss is a stability issue frequently faced by LLMs in long dialogues or complex tasks. As a conversation progresses, the model may forget previous contextual information, leading to outputs that lack coherence and logic. This stems from the insufficient long-term memory capabilities of LLMs, making it difficult to maintain consistent memory over long durations or across sessions. This problem is particularly prominent in human-machine psychological counseling, where forgetting previous context affects the accuracy of individual psychological assessments and poses significant challenges to multi-session evaluations and dynamically changing psychometric tasks. When relevant information in the input text is located in the middle of the context, the accuracy of many LLMs drops significantly. This occurs because the attention mechanisms of models, during pre-training and fine-tuning, tend to focus more on content at the beginning and end while ignoring the middle. Specifically, as textual distance increases, the decay of attention makes it difficult for the model to focus on middle sections far from the current position, leading to information loss or misinterpretation \cite{2023}. To resolve this, researchers proposed Position-Agnostic Multi-step question decomposition training. By requiring the model to search for information at different positions and extract relevant content, this method balances the attention distribution. This approach enables the model to better handle long-text inputs, significantly improving performance across various benchmarks, especially in multi-document question-answering tasks \cite{2024}. Additionally, factual errors are another urgent issue. Although LLMs can generate plausible-sounding answers, these answers may sometimes be inaccurate or entirely false—a phenomenon commonly referred to as "hallucination" \cite{Huang 2025}. Since models cannot verify the truthfulness of generated content, this is a major flaw in psychometric tasks requiring accurate facts. LLMs rely on large-scale natural language data and often output common internet or popular book content that may not meet the standards of psychological theory or evidence. Furthermore, they may inherit and amplify existing biases in the training data; their outputs may reflect cultural and social biases, rendering LLMs potentially unreliable for producing language helpful for mental health \cite{Demszky 2023}. Current research attempts to enhance the validity of LLM output and reduce cultural and linguistic biases through methods such as training on more diverse data, introducing fact-checking and self-reflection mechanisms, connecting to external databases, and optimizing prompts.

These methods aim to mitigate the generation of hallucinations \cite{Demszky 2023, 2024; Huang 2024; 2023; Rawte 2023; Stade 2024; 2024}. Measurement invariance is another critical issue to consider when applying LLMs. As technology undergoes rapid iteration, the performance and output of a measurement tool built on a closed-source model (such as Gemini) may drift as the underlying model is updated \cite{Chen 2023; 2024}. This uncertainty severely undermines the stability of measurements and may alter the results of longitudinal studies. For closed-source models, regular monitoring mechanisms should be established, using typical response datasets to verify whether the model output maintains consistency. For open-source models like DeepSeek, using fixed versions and implementing localized deployments are essential strategies for ensuring measurement invariance.

3.2 LLMs

The application of generative Large Language Models (LLMs) in psychometrics faces significant challenges regarding creativity. Although LLMs have demonstrated immense potential in generating linguistic content—with the latest generation performing comparably to, or even outperforming, human subjects in certain dimensions of creativity tests \cite{Bellemare-Pepin 2024, Guzik 2023}—several critical issues persist. First, LLMs lack true originality; they generate content by recombining existing patterns found within their training data. Despite the vast scale of this data, their output is not based on fundamentally new ideas or genuine innovation. While their average performance on creativity tests may exceed the human mean, the dispersion of their scores is far narrower than that of human subjects, making it difficult for them to achieve exceptionally high "breakthrough" scores \cite{Hubert 2024, Bellemare-Pepin 2024}. In the context of psychometrics, this can result in assessment tools or items that lack novelty and fail to transcend existing measurement paradigms. Due to their reliance on learned structures and common patterns, LLMs may produce formulaic content when generating creative materials, such as situational judgment items or open-ended questions, often remaining at the surface level of textual concepts. This limitation is particularly evident in psychometric tasks that require diversity and complexity to capture individual differences. Furthermore, LLMs currently lack the capacity to create new constructs. Psychological research often necessitates the proposal of new psychological constructs to evaluate traits that are not yet fully understood; however, LLMs can only generate content based on established linguistic patterns and cannot transcend their training data. Whether LLMs possess true creativity remains an open research question \cite{Zhao, 2024b}, which limits their application in developing entirely new assessment tools. To address these issues, researchers have proposed using multi-agent systems.

By employing multiple agents to play different roles in discussion, researchers can simulate the way humans form collective creativity to enhance the innovative capabilities of LLMs \cite{2024}. Alternatively, associative thinking strategies can be utilized to help LLMs improve their ability to integrate disparate concepts \cite{Mehrotra 2024}. These approaches may eventually empower new conceptual frameworks within the field of psychometrics.

3.3 LLMs

Scalability Issues of Large Language Models

The scalability of generative Large Language Models (LLMs) remains one of the primary constraints limiting their practical application. Although LLMs have demonstrated powerful capabilities in text generation and natural language processing, their extended application in psychometrics still faces significant challenges. Specifically, there are limitations regarding the adaptation to new constructs. Psychometric constructs continuously evolve alongside new psychological theories and empirical research; however, LLMs typically rely on pre-existing datasets for training. This reliance makes them appear inflexible and inadequate when processing or adapting to data outside the original sample distribution. Furthermore, it is difficult for these models to effectively integrate new psychometric constructs, thereby restricting their application in the dynamically evolving field of psychology. Although some studies have validated the reliability and validity of scales developed based on new constructs using LLMs \cite{Hoffmann2024, draogo2024}, researchers must still possess strong critical thinking skills and professional expertise to provide appropriate guidance at every step of scale construction. Current research lacks a comprehensive evaluation of applications such as automated item generation and subject simulation across different testing domains, and their psychometric properties and content reliability have not yet been extensively verified \cite{Circi2023, 2022}.

Scalability issues are also reflected in the integration of multimodal data. While LLMs can analyze and output multimodal data, they are often limited to shallow-level understanding and fail to fully capture the complex relationships between different data modalities. In other words, the current application of multimodal LLMs in psychometrics is insufficient to fully mine and utilize the potential of multimodal data, which limits their use in complex psychometric tools. These complex relationships still require manual pre-alignment. Furthermore, adaptability across cultural and linguistic contexts poses a challenge. Psychometric tools often require cross-cultural application, necessitating the ability to handle linguistic and psychological constructs from diverse cultural backgrounds. However, the training data for LLMs may be biased toward specific cultures or languages, leading to a marked deficiency in their scalability and adaptability in cross-cultural settings. This limitation may introduce bias into psychometric tools in different cultural environments, reducing their validity and universality \cite{Huang2024, 2023}. Additionally, psychometrics covers multiple domains, such as cognitive assessment, emotional measurement, and social evaluation.

When extending to these diverse domains, LLMs may fail to maintain consistent performance and accuracy, as their generalization capabilities remain limited \cite{2024}.

This requires researchers to fine-tune models according to the specific needs of a given field during use. At the same time, the current mathematical calculation and reasoning abilities of LLMs are suboptimal. When using LLMs to assist in generating items that measure higher-order cognitive functions, the models may exhibit logical confusion and internal contradictions. Research indicates that causal knowledge graph technology and Chain-of-Thought (CoT) techniques can be employed to help LLMs achieve more logical task execution flows and discover latent associations between concepts \cite{Tong2024}. OpenAI's release of the latest GPT-o1 model represents a preliminary exploration in this area; by applying CoT techniques, it has significantly improved logical reasoning and computational capabilities. However, its computational requirements are excessively high, making it difficult to popularize. In summary, the scalability issues of LLMs present multifaceted challenges for their application in psychometrics. To truly achieve widespread application, further progress is needed in model flexibility, multimodal data understanding, multi-domain extensibility, and cultural adaptability.

3.4 LLMs

Ethical Considerations, Data Security, and Cost Issues

The application of Large Language Models (LLMs) in the field of psychometrics requires a cautious approach toward the resulting ethical, security, and cost-related challenges. First, because LLMs are derived from pre-training corpora, they inevitably inherit human cultural and racial biases \cite{Chen2024, Taubenfeld2024}. If these models are applied to psychological assessment without rigorous auditing, their outputs may produce systematic injustices against specific groups. For example, a model trained primarily on Western-centric corpora may produce inaccurate or even erroneous judgments when evaluating the psychological states of individuals from non-Western cultural backgrounds \cite{Sakai2025}. Consequently, it is crucial to perform bias detection and calibration before application and to ensure the diversity and representativeness of the training data.

Second, data privacy and security remain paramount concerns. Psychometric assessment, particularly in the context of mental health diagnosis and treatment, involves a vast amount of highly sensitive personal privacy data. When utilizing cloud-based LLM services, the risk of data leakage during transmission and storage cannot be ignored \cite{Lawrence2024}. Furthermore, data used for fine-tuning models may be vulnerable to extraction or misuse. Ensuring anonymization throughout the entire data pipeline and developing open-source models that can be deployed locally are critical pathways for safeguarding participant privacy.

Additionally, high costs represent a practical bottleneck and a significant challenge to the widespread application of LLMs in psychometrics. Whether invoking the APIs of top-tier closed-source models for large-scale data processing or fine-tuning and deploying open-source models, substantial computational resources and financial investment are required. Currently, open-source models based on more efficient architectures, such as Mixture-of-Experts (MoE)—including DeepSeek and GPT-oss—achieve performance comparable to large closed-source models at significantly lower training and deployment costs. These advancements provide new possibilities for reducing the barriers to entry for

psychometric applications.

4 生成式

Future Prospects for AI-Enabled Psychometrics

Building upon the comprehensive discussion of the significant advantages and technical challenges of generative AI, this section proposes several key potential applications for empowering psychometrics. These include the generation of Situational Judgment Tests (SJTs), the assessment of Collaborative Problem Solving (CPS) skills, intelligent diagnosis and treatment for mental health, and the analysis of item quality. These applications aim to provide new insights and directions for the future development of psychometric research and practice.

Situational Judgment Test Generation

Generative AI can significantly enhance the development of Situational Judgment Tests (SJTs) by automating the creation of complex, realistic scenarios. Traditional SJT development is often labor-intensive and time-consuming; however, large language models can be leveraged to generate diverse social and professional contexts tailored to specific psychological constructs. By ensuring that these scenarios maintain high ecological validity while adhering to psychometric standards, researchers can more efficiently measure practical intelligence and behavioral tendencies across various domains.

Assessment of Collaborative Problem Solving (CPS)

The assessment of Collaborative Problem Solving (CPS) represents a critical frontier in modern psychometrics. Generative models can serve as standardized "virtual partners" or "agents" within a collaborative environment, allowing for a controlled yet dynamic evaluation of an individual's ability to communicate, manage conflict, and synchronize efforts with others. This approach addresses the inherent difficulty of standardizing human-to-human interactions in testing environments, providing a more reliable and scalable framework for measuring interpersonal and cognitive competencies simultaneously.

Intelligent Diagnosis and Treatment in Mental Health

In the realm of mental health, generative AI offers transformative potential for both diagnosis and therapeutic intervention. By analyzing natural language patterns and behavioral data, these systems can assist clinicians in identifying subtle markers of psychological distress or cognitive decline. Furthermore, generative models can facilitate "smart" therapeutic interactions, such as personalized cognitive-behavioral interventions or supportive dialogue systems. These tools can provide continuous monitoring and real-time support, bridging the gap between traditional clinical visits and the daily needs of patients.

Item Quality Analysis and Optimization

Finally, generative AI can be applied to the rigorous analysis and optimization of test items. Beyond simple automated scoring, these models can be used to predict item difficulty, identify potential biases, and detect linguistic ambiguities that might interfere with measurement accuracy. By simulating how different demographic groups or ability levels might respond to a given item, researchers can refine assessments during the development phase. This proactive approach to item quality analysis ensures that psychometric instruments are both robust and equitable,

4.1 情境判断测验:基于

Test generation, by integrating vast amounts of textual data, possesses a deep understanding of the psychological traits inherent in broad populations, enabling the creation of questions that align with specific personality theory frameworks. For example, when designing Big Five personality tests, new items can be created based on existing personality dimensions—Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism—ensuring that these questions accurately reflect the subjects' personality traits while maintaining high reliability and validity \cite{Gotz 2023}. Beyond self-report personality tests, the generative capabilities of these models are even more effective in the development of Situational Judgment Tests (SJTs). SJTs are assessment tools commonly used in personnel selection and are well-suited for predicting an individual's job performance. They evaluate decision-making abilities and behavioral tendencies by presenting candidates with a series of simulated work-related scenarios and requiring them to select the most appropriate response strategy or behavior \cite{Burrus 2012}. Currently, SJTs in personnel selection scenarios face issues of non-reusability due to item exposure, and the formulation of SJT items is particularly dependent on domain experts and rigorous development processes.

By learning from massive amounts of textual data, models have acquired the ability to simulate and role-play behavioral manifestations associated with different personality traits. This simulation capability has been validated in previous research \cite{Hewitt 2024; 2025; 2025}. The ability to simulate how individuals exhibit different behaviors in various situations due to their internal traits can further assist in generating reliable situational judgment items. In the generation of cognitive and personality test items, this approach has been proven to possess reliability and validity comparable to, or even exceeding, items developed by humans \cite{Laverghetta Licato, 2023; 2025}. In practice, the development of SJTs generally involves several steps: scenario selection and item formulation, development of typical behavioral response options, and the design of scoring for these response options \cite{McDaniel 2007}. Regarding scenario selection and item formulation, psychologists or other professionals in the field typically determine scenarios based on experience and literature reviews, and experts then write items based on the selected scenarios. This process can require significant manual input, including reading relevant materials, intensive discussions, and repeated revisions. The latent knowledge within generative models can support the generation of diverse content.

By generating simulated scenarios that closely mirror actual work environments, experts can serve as evaluators of the generated content, thereby reducing their workload and improving efficiency. In the development and scoring of typical behavioral response options, each scenario may require multiple choices. To ensure these options reflect different behavioral tendencies, test developers must engage in precise design and drafting; furthermore, the scoring of response options relies heavily on domain experts' understanding and judgment of the scenarios. Leveraging its understanding of the internal differences between various traits and synthesizing diverse generated content, the model can produce a large number of eligible response options based on specified scoring criteria. Combined with expert experience, it can then select indistinguishable combinations of options across different scoring levels, enabling the efficient completion of the preliminary SJT development. It is worth noting that tests generated in this manner still require data collection through administration to complete the validation of test quality.

4.2 能力测验:基于

Collaborative problem-solving (CPS) competency assessment focuses on evaluating an individual's cognitive abilities and professional skills. Assessment tools typically include formats such as multiple-choice questions, situational simulations, and computer-based interactive tests \cite{2021}. Although these methods demonstrate a certain degree of validity in assessing individual capabilities, they possess significant limitations in simulating real-world collaborative task scenarios. Despite the increasing power of interactive functions in competency testing driven by the development of artificial intelligence, the interaction partners for examinees remain primarily machine agents governed by fixed logic. While this approach facilitates the standardization of the assessment process, it struggles to effectively mobilize an individual's communication and collaborative skills.

The emergence of Large Language Models (LLMs) helps to address this situation, as they can effectively assume roles defined by researchers and engage in free-form dialogue with examinees \cite{Jandaghi2023}. Specifically, researchers have already designed prompt frameworks to simulate human brainstorming processes, thereby participating in creative problem-solving \cite{Chang2024}. With the continuous improvement of problem-solving capabilities, LLM-based agents no longer merely serve as human tools; instead, they can act as standardized partners participating in defined problem-solving processes. This allows for the externalization and better measurement of an examinee's collaboration-related competencies, such as communication skills and collaborative problem-solving. In traditional human-to-human collaborative assessment paradigms, the behavioral variability of human assessors introduces additional measurement error \cite{Biswas2010, Stadler2020}. LLM agents, through pre-established task strategies, can exhibit more stable and controllable behavioral patterns during the collaboration process, thereby reducing systematic error stemming from the assessment partner to a certain extent. Furthermore, compared to fixed-logic human-machine cooperation paradigms, this approach more closely aligns with real-world collaborative scenarios and possesses higher ecological validity.

Implementing assessment within LLM-based collaborative problem-solving requires considering two aspects: first, how to design specific situational assessment tasks, and second, how to achieve automated competency evaluation. To ensure the validity of the assessment, the design of situational tasks must not be overly simplistic; they should not be easily completed by an individual working independently. Task design must account for openness and ambiguity, allowing for multiple potential solutions to comprehensively examine the examinee's decision-making ability and adaptability under uncertain conditions. This approach fully mobilizes the examinee's willingness to collaborate and requires them to repeatedly judge, select, or revise the LLM's output. Through free interaction with the LLM, examinees can become deeply immersed in solving the set problem, thereby producing behavioral responses that reflect their true underlying competencies. Automated competency evaluation requires a comprehensive consideration of the relationship between the multimodal data interaction formats and the multidimensional competencies activated during the collaborative problem-solving process. The data interaction process may involve various modalities such as text, images, and audio, while the competencies should encompass two major dimensions: problem-solving and communication/collaboration. By synthesizing these multimodal data, a comprehensive competency profile can be constructed, breaking through the limitations of previous assessments to provide more accurate and detailed results for practical applications such as personnel selection.

4.3 心理健康测验:基于

Intelligent diagnosis and treatment through mental health testing is a critical application within the field of psychometrics, used to evaluate various aspects of an individual's psychological well-being, such as depression, anxiety, and stress. Traditional methods, such as self-report questionnaires, suffer from significant subjectivity. Particularly in an era of rapid information flow, intentional faking driven by abnormal motivations poses substantial challenges to accurate mental health diagnosis. While clinical interviews and behavioral observations offer higher validity, they are difficult to implement in large-scale screenings \cite{2022}. By processing complex textual inputs, providing targeted questioning and feedback, and identifying implicit emotional states, new opportunities have emerged for mental health testing.

The transition of mental health testing toward intelligent diagnosis and treatment can be advanced through two primary avenues: interview-based testing formats and psychological state assessments based on continuous expression. Specifically, current mental health testing in China is primarily utilized for population screening, where individuals identified as "at-risk" are further evaluated by professional psychological counselors \cite{2018}. To ensure that no at-risk individuals are overlooked, this process typically requires the participation of a large number of practitioners. However, the supply of professional counselors often falls short of the demand for follow-up services after large-scale screenings. In this context, these systems can serve as auxiliary tools by simulating the role of a counselor, engaging in natural dialogue with subjects, and guiding them to express their inner thoughts \cite{Chen2023}. They can be trained to apply specific counseling techniques, such as cognitive restructuring, or other therapeutic interventions.

By generating highly empathetic responses, these systems help clients overcome communication barriers caused by shame or distrust \cite{Xiao2024}, thereby creating a safe environment for expression. This automated interview approach not only alleviates the workload of professionals but also enables the collection of vast amounts of valuable real-time data without disrupting the flow of conversation. Furthermore, regular online follow-ups hold the potential for the timely detection of psychological changes in subjects. Regarding psychological state assessment, subjects no longer provide feedback and scores in a standardized format as they do with traditional self-report scales; instead, they reveal their true states and underlying triggers through personalized Q&A processes. Analyzing the continuous expressions of individuals during dialogues with large language models will yield more precise and detailed mental health diagnostic results, while providing personalized recommendations and support for both subjects and interventionists.

Currently, mental health assessment systems have been developed that enable models to acquire the interview frameworks and diagnostic criteria used in psychiatric evaluations. By mimicking the way clinicians conduct psychiatric interview assessments on patients, these systems have demonstrated a high level of consistency between their evaluative results and those of professional psychiatrists \cite{2025}.

4.4 试题评价:基于

Item Quality Analysis

Item evaluation is a critical component of psychometrics, aimed at ensuring the validity and reliability of assessment tools. Traditional methods for evaluating test items typically rely on expert review and statistical analysis; however, these approaches are often time-consuming and prone to subjective bias \cite{2013}. Large Language Models (LLMs) have been employed for role-playing in numerous applications with promising results \cite{2024; 2024}. By simulating domain experts or individual examinees with varying ability levels, LLMs can facilitate automated item quality analysis, thereby enhancing both the efficiency and accuracy of the evaluation process.

LLMs simulating domain experts can automatically assess the content quality of test items, including linguistic clarity, logical consistency, and difficulty levels. By utilizing multiple general-purpose or domain-specifically trained LLMs, researchers can calculate inter-rater reliability in a manner similar to traditional expert reviews. Selecting high-scoring items through this process reduces the heavy reliance on human experts during item evaluation. Furthermore, LLMs can simulate individual examinees from diverse cultural backgrounds and ability levels through role-playing. Theoretically, by generating responses that match corresponding representative populations and applying methods such as Item Response Theory (IRT), it is possible to preliminarily estimate parameters like item difficulty and discrimination. This provides a rapid, low-cost pre-testing means for item quality analysis \cite{Wang, 2024}. Regarding the generation of simulated subjects, several challenges remain. First, in terms of cultural background, biases in the training corpora of mainstream LLMs make it difficult for them to grasp the deep-seated values and expression habits of specific cultures, potentially leading to biased generated responses. Second, regarding age, models often lack sufficient modeling of the linguistic styles and cognitive characteristics of children and adolescents, making it difficult to authentically replicate their answering patterns. Finally, in terms of ability, models tend to generate logically coherent responses, while their capacity to simulate the specific knowledge gaps and error patterns typical of low-ability examinees may be insufficient.

The use of simulated subjects as a tool for item analysis is currently still in an exploratory stage. While their results cannot replace data from real human participants, they serve as a beneficial supplement to expert review and as a tool for early-stage item analysis. Empowering item quality analysis through LLMs helps ensure that test items are suitable for a broad range of examinees, thereby improving the fairness and representativeness of assessment instruments.

5 总结

This paper explores the opportunities and challenges that generative Large Language Models (LLMs) bring to the field of psychometrics. It emphasizes their significant advantages in transforming test interaction methods, enhancing multimodal data processing capabilities, and broadening scoring techniques. While technical challenges such as stability, reliability, and generalizability remain, LLMs have already demonstrated broad prospects in areas such as the generation of situational judgment tests, the assessment of collaborative problem-solving, intelligent diagnosis and treatment in mental health, and the analysis of test item quality. With continuous technical optimization—such as improving model performance through instruction fine-tuning and consistency calibration training—LLMs will continue to drive psychometrics toward greater intelligence, personalization, and efficiency.

The core framework for empowering psychometrics.

参考文献

(2018). Development of the Mental Health Screening Scale for Chinese College Students. Studies of Psychology and Behavior. (2022). New Approaches to AI-Assisted Mental Health Assessment. Advances in Psychological Science, (01), (2025).

The influence of good and evil personality roles on the moral judgment of Large Language Models. 57(6), (2021). New trends in educational evaluation: A review of research on intelligent assessment. Modern Distance Education Research, (05), (2025).

The validity of Large Language Models in simulating regional psychological structures: Personality and well-being. 48(4), (2013). Application of Item Response Theory in evaluating the quality of items in large-scale selective examinations. (01), (2018). Predicting students' reasoning ability and mathematical achievement via log-file analysis: Applications of machine learning. (2020).

Development of psychological and educational measurement in China. Bilingual Journal of Educational Measurement and Evaluation. Article by Acheampong, Nunoo-Mensah, and Chen, (2021).

Transformer models text-based emotion detection: review BERT-based approaches.

Artificial Intelligence Review Verma, Zhang, (2024).

Large language models mathematical reasoning: Progresses challenges.

Proceedings Conference European Chapter Association Computational Linguistics:

Student Research Workshop 237). Association Computational Linguistics.

Alomran, Chai, (2018). Automated scoring system multiple choice quick feedback.

International Journal Information Education Technology Antypas, Preece, Camacho-Collados, (2023).

Negativity spreads faster: large-scale multilingual twitter

analysis

sentiment political communication. Online Social Networks Media Asudani, Nagwani, Singh, (2023).

Impact embedding models analytics learning environment: review.

Artificial Intelligence Review

Voelkel, Muldowney, Eichstaedt, Willer, (2025). LLM-generated messages persuade humans policy issues.

Nature communications Bellemare-Pepin, Lespinasse, Harel, Mathewson, Olson, Bengio Jerbi, (2024).

Divergent creativity humans large language models arXiv preprint arXiv:2405.13012.

Bergner, Droschler, Kortemeyer, Rayyan, Seaton, Pritchard, (2012).

Model-based collaborative filtering

analysis

student response data: Machine-learning response theory.

Proceedings International Conference Educational Mining International Educational

Mining Society.

Berry, (2008). Traditional Assessment: Paper-and-Pencil Tests.

Assessment learning

Kong University Press.

Chen, Wang, Xiao, Zhang, Huang, Chen, Peng, Feng, Huang, (2025).

MAGI: Multi-agent guided interview psychiatric assessment arXiv preprint arXiv:

Biswas, Jeong, Kinnebrew, Sulcer, Roscoe, (2010). Measuring self-regulated learning skills through social interactions teachable agent environment.

Research Practice Technology Enhanced Learning Boake, (2002). binet-simon wechsler-bellevue:

Tracing history intelligence testing. Journal Clinical Experimental Neuropsychology Brady, McLoughlin, Doan, Crockett, (2021). social learning amplifies moral outrage expression online social networks.

Science Advances (33), eabe5641. Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Winter, Amodei, (2020).

Language models few-shot learners. Advances Neural Information Processing Systems Buongiorno, Klinkert, Chawla, Zhuang, Clark, (2024).

PANGeA: Procedural artificial narrative using generative turn-based video games arXiv preprint arXiv:

Chan, Park, Sham, Chong, Qian, (2025).

Automatic generation various subjects using large language model prompting.

Computers Education: Artificial Intelligence Chang, H.-F., (2024). framework collaborating large language model brainstorming triggering creative thoughts arXiv preprint arXiv:

Chen, Chen, Jiang, Wang, (2024). Humans judge? study judgement biases arXiv preprint arXiv:

Chen, Zaharia, (2023). ChatGPT behavior changing time? arXiv preprint arXiv:

Chen, (2020). continuous-time dynamic choice measurement model problem-solving process data.

Psychometrika Chen, Ying, (2019). Statistical

analysis

complex problem-solving process data: event history

analysis

approach. Frontiers Psychology Chen, Zhou, (2019).

Research automatic essay scoring composition based International Conference Artificial Intelligence (ICAIBD) Chiu, Sharma, Althoff, (2024). computational framework behavioral assessment therapists arXiv:2401.00820; Issue arXiv:2401.00820). arXiv.

S.-J., Brown-Schmidt, Boeck, Shen, (2020). Modeling intensive polytomous time-series eye-tracking data: dynamic tree-based response model.

Psychometrika Circi, Hicks, Sikali, (2023). Automatic generation:

Foundations machine learning-based approaches assessments.

Frontiers Education Zhao, Hong, (2024). AI-based section discusses application effect bag-of-words models TF-IDF tasks.

Journal Artificial Intelligence General Science (JAIGS) ISSN:3006-4023 Pang, Dong, (2024). unfairness information retrieval systems: challenges Proceedings SIGKDD Conference Knowledge Discovery Mining Ermon, Rudra, (2022).

Flashattention: memory-efficient exact attention io-awareness.

Advances Neural Information Processing Systems

Debnath, Brien, Yamaguchi, Behera, (2022). review computer vision-based approaches physical rehabilitation assessment.

Multimedia Systems Deerwester, Dumais, Landauer, Furnas, Harshman, (1990).

Indexing latent semantic analysis. Journal American Society Information Science Technology Demszky, Hill, Jurafsky, Piech, (2024). automated feedback improve teachers uptake student ideas?

Evidence randomized controlled trial large-scale online course.

Educational Evaluation Policy

Analysis

Demszky, Yang, Yeager, Bryan, Clapper, Chandhok, Eichstaedt, Hecht, Jamieson, Johnson, (2023).

Using large language models psychology. Nature Reviews Psychology (11), DeRosier, Thomas, (2019). heroes: digital social skills training young adolescents.

International Journal Computer Games Technology Devine, Kovatchev, Grumley Traynor, Smith, (2023).

Machine learning learning systems automated measurement advanced theory mind:

Reliability validity children adolescents. Psychological Assessment Dipietro, Sabatini, Dario, (2008). survey glove-based systems their applications.

Transactions Systems, Cybernetics, (Applications Reviews) Dong, Wang, Jiang, Zhang, (2024).

EmoAda: multimodal emotion interaction psychological adaptation system.

Rudinac, Hanjalic, Liem, Worring, nsson, Yamakata (Eds.), Multimedia Modeling 307).

Springer Nature Switzerland. Zheng, Nakamura, Chen, (2025).

Privacy fine-tuning large language models: Attacks, defenses, future directions.

Spiliopoulou, Wang, Kumar, Advances Knowledge Discovery Mining 344).

Springer Nature. Eichstaedt, Smith, Merchant, Ungar, Crutchley, iuc-Pietro, Asch, Schwartz, (2018).

Facebook language predicts depression medical records.

Proceedings National Academy Sciences (44),

Ersozlu, Taheri, Koch, (2024). review machine learning

methods

educational data. Education Information Technologies Fernandez, Ghosh, Wang, Choffin, Baraniuk, (2022).

Automated scoring reading comprehension in-context tuning.

Rodrigo, Matsuda, Cristea, Dimitrova (Eds.), Artificial Intelligence Education (Vol. 13355, 697).

Springer International Publishing. Foltz Chandler Diaz-Asper Cohen Rodriguez Holmlund Elvev (2023).

Reflections nature measurement language-based automated assessments patients mental state cognitive function.

Schizophrenia Research Gabbay, Cohen, (2024). Combining LLM-generated test-based feedback programming.

Proceedings Eleventh Conference Learning Scale Goretzko, hner, (2022).

Note: Machine learning modeling optimization techniques psychological assessment.

Psychological Assessment Modeling Gotz, Maertens, Loomba, Linden, (2023). algorithm speak: neural networks automatic generation psychological scale development.

Psychological

Methods

Guzik, Byrge, Gilde, (2023). originality machines: takes torrance test.

Journal Creativity Haizel, Vernanda, Wawolangi, Hanafiah, (2021).

Personality assessment video based five-factor model.

Procedia Computer Science Dong, Song, Liang, Wang, Zhang, Zhang, (2024).

Never middle: Mastering long-context question answering position-agnostic decompositional training arXiv preprint arXiv:

Borgonovi, Paccagnella, (2021). Leveraging process assess adults problem-solving skills:

Using sequence mining identify behavioral patterns across digital tasks.

Computers Education Hewitt, Ashokkumar, Ghezae, Willer, (2024).

Predicting

results

social science experiments using large language models Hoffmann, Lasarov, Dwivedi, (2024).

AI-empowered scale development: Testing potential

ChatGPT. Technological Forecasting Social Change Hommel, Wollang, F.-J.

Kotova, Zacher, Schmukle, (2022). Transformer-based neural language modeling construct-specific automatic generation.

Psychometrika (2024). Developing AI-based psychometric system assessing learning difficulties adaptive system overcome: qualitative conceptual framework arXiv preprint arXiv:

Huang, Chang, C.-C. (2023). Towards reasoning large language models: survey arXiv preprint arXiv:

Huang, Zhong, Feng, Wang, Chen, Peng, Feng, (2025). survey hallucination large language models:

Principles, taxonomy, challenges, questions. Transactions Information Systems Huang, Wang, Zhang, Huang, Zhang, Wang, Zhang, Vidgen, Kailkhura, Xiong, Xiao, Zhao, (2024).

TrustLLM: Trustworthiness large language models arXiv preprint arXiv:

Huawei, Aryadoust, (2023). systematic review automated writing evaluation systems.

Education Information Technologies Hubert, Zabelina, (2024). current state artificial intelligence generative language models creative humans divergent thinking tasks.

Scientific Reports Janiesch, Zschech, Heinrich, (2021).

Machine learning learning. Electronic Markets Jatnika, Bijaksana, Suryani, (2019).

Word2vec model

analysis

semantic similarities english words. Procedia Computer Science Ishii, Fung, (2023).

Towards mitigating hallucination reflection. Bouamor, Pino, (Eds.), Findings Association Computational Linguistics:

EMNLP 1843). Association Computational Linguistics.

Jose, Matero, Sherman, Curtis, Giorgi, Schwartz, Ungar, (2022).

Using facebook language predict describe excessive alcohol Alcoholism:

Clinical Experimental Research

Luchini, Linell, Reiter-Palmon, Beaty, (2024). creative psychometric generator: framework generation validation using large language models arXiv preprint arXiv:

Tong, Cheng, Peng, (2025). Exploring frontiers psychological applications: comprehensive review.

Artificial Intelligence Review (10), Kharitonova, rez-Fern ndez, rrez-Hernando, rrez-Fandi Callejas, Griol, (2024).

Incorporating evidence mental health novel

method

generative language models validated clinical content extraction.

Behaviour Information Technology Almond, Shute, (2016).

Applying evidence-centered design development game-based assessments physics playground.

International Journal Testing Landauer, Foltz, Laham, (1998).

introduction

latent semantic analysis. Discourse Processes Laverghetta Licato, (2023).

Generating better items cognitive assessments using large language models.

Kochmar, Burstein, Horbach, Laarmann-Quante, Madnani, Tack, Yaneva, Yuan, Zesch (Eds.), Proceedings Workshop Innovative Building Educational Applications 2023) 428).

Association Computational Linguistics. Lawrence, Schneider, Rubin, Matari McDuff, Bell, (2024). opportunities risks large language models mental health.

Mental Health e59479. Lazos, Poovendran, Capkun, (2005).

ROPE: Robust position estimation wireless sensor networks.

Fourth International Symposium Information Processing Sensor Networks, (2022).

Developing AI-based chatbot practicing responsive teaching mathematics.

Computers Education G.-G., Latif, Zhai, (2024). Applying large language models chain-of-thought

automatic scoring. Computers Education: Artificial Intelligence C.-J., Zhang, Tang, (2025).

Automatic generation personality situational judgment tests large language models arXiv preprint arXiv:

Zhang, Y.-C., Kraut, Mohr, (2023). Systematic review meta-analysis AI-based conversational agents promoting mental health well-being.

Digital Medicine Liang, B.-W., Jeong, H.-C., Choi, (2018).

Automated essay scoring: siamese bidirectional neural network architecture.

Symmetry (12), Article Liao, Jiao, (2023). Modelling multiple problem solving strategies strategy shift cognitive diagnosis growth.

British Journal Mathematical Statistical Psychology Girard, Sayette, Morency, L.-P. (2020).

Toward multimodal modeling emotional expressiveness.

Proceedings International Conference Multimodal Interaction Hewitt, Paranjape, Bevilacqua, Petroni, Liang, (2023). middle: language models contexts arXiv preprint arXiv:

Brew, Blackmore, Gerard, Madhok, Linn, (2014). Automated scoring constructed-response science items:

Prospects obstacles. Educational Measurement: Issues Practice Bhandari, Pardos, (2025).

Leveraging respondents evaluation: psychometric analysis.

British Journal Educational Technology L.-C., Chen, S.-J., T.-M., C.-H., S.-H. (2024). discussion:

Enhancing creativity large language models

discussion

framework role-play arXiv preprint arXiv: Wang, (2024).

Generative students: Using LLM-simulated student profiles support question evaluation Ludwig, Mayer, Hansen, Eilers, Brandt, (2021).

Automated essay scoring using transformer models. Psych Zhang, (2015).

Using Word2Vec process data. International Conference Data)

(2019). Cognitive diagnosis models multiple strategies.

British Journal Mathematical Statistical Psychology Yang, stner, (2024). (why) prompt getting worse?

Rethinking regression testing evolving APIs. Proceedings IEEE/ACM International Conference Engineering Software Engineering Majumder, Poria, Gelbukh, Cambria, (2017). learning-based document modeling personality detection text.

Intelligent Systems Harring, Zhan, (2022). Bridging models biometric psychometric assessment: three-way joint modeling approach responses, response times, fixation counts.

Applied Psychological Measurement nez-Plumed, ncio, nez-Us ndez-Orallo, (2016).

Making sense response theory machine learning. Proceedings Twenty-second European Conference Artificial Intelligence 1148).

Matarazzo, (1992). Psychological testing assessment century.

American Psychologist Mavridis, Tsiatsos, (2017). based assessment:

Investigating impact anxiety performance. Journal Computer Assisted Learning McKenna, (2019).

Multiple choice questions: Answering correctly knowing answer.

Interactive Technology Smart Education Mehrotra, Parab, Gulwani, (2024).

Enhancing creativity large language models through associative thinking strategies arXiv preprint arXiv:

Memon, Sami, Khan, Uddin, (2020). Handwritten optical character recognition (OCR): comprehensive systematic literature review (SLR).

Access Meyer, Jansen, Schiller, Liebenow, Steinbach, Horbach, Fleckenstein, (2024).

Using bring evidence-based feedback classroom: AI-generated feedback increases secondary students revision, motivation, positive emotions.

Computers Education: Artificial Intelligence Obrenovic, Starcevic, (2004).

Modeling multimodal human-computer interaction. Computer

OECD. (2013). assessment analytical framework: Mathematics, reading, science, problem solving financial literacy Organisation Economic Co-operation Development. (2017). assessment analytical framework:

Science, reading, mathematic, financial literacy collaborative problem solving Organisation Economic Co-operation Development. draogo, Kabor Tian, Song, Koyuncu, Klein, Bissyand (2024).

Large-scale, independent comprehensive study power generation arXiv preprint arXiv:

Palanivinayagam, El-Bayeh, (2023). Twenty years machine-learning-based classification: systematic review.

Algorithms Palumbo, Perkins, Yancey, Brislin, Patrick, Latzman, (2020).

Toward multimodal measurement model neurobehavioral trait affiliative capacity.

Personality Neuroscience Pellert, Lechner, Wagner, Rammstedt, Strohmaier, (2024). psychometrics:

Assessing psychological profiles large language models through psychometric inventories.

Perspectives Psychological Science Pennington, Socher, Manning, (2014).

Glove: Global vectors representation. Proceedings Conference Empirical

Methods

Natural Language Processing (EMNLP) Peters, Neumann, Iyyer, Gardner, Clark, Zettlemoyer, (2018). contextualized representations arXiv:1802.05365). arXiv.

Pliakos, S.-H., Park, Cornillie, Vens, Noortgate, (2019).

Integrating machine learning response theory addressing start problem adaptive learning systems.

Computers Education Press, Smith, Lewis, (2022). Train short, long:

Attention linear biases enables input length extrapolation arXiv preprint arXiv:

Radford, Narasimhan, Salimans, Sutskever, Improving language understanding

generative pre-training .

Rahman, Faisal, Khanam, Amjad, Siddik, (2019). Personality detection

using convolutional neural network. International Conference Advances Science, Engineering Robotics Technology (ICASERT) Ramnarain-Seetohul, Bassoo, Rosunally, (2022).

Similarity measures automated essay scoring systems: ten-year review.

Education Information Technologies Yerukola, Shah, Reinecke, (2025).

NormAd: framework measuring cultural adaptability large language models Rathje, Mirea, D.-M., Sucholutsky, Marjieh, Robertson, Bavel, (2024). effective multilingual psychological analysis.

Proceedings National Academy Sciences (34), e2308950121.

Rawte, Priya, Tonmoy, Zaman, Sheth, (2023). Exploring relationship between hallucinations prompt linguistic nuances:

Readability, formality, concreteness arXiv preprint arXiv:

Shen, Diao, (2021). sentiment-aware learning approach personality detection text.

Information Processing Management Nakshatri, Goldwasser, (2022).

Towards few-shot identification morality frames using in-context learning.

Bamman, Hovy, Jurgens, Keith, Connor, Volkova (Eds.), Proceedings Fifth Workshop Natural Language Processing Computational Social Science (NLP+CSS) 196).

Association Computational Linguistics. Sakai, Kang, Kwak, (2025).

Somatic east, psychological west?: Investigating clinically-grounded cross-cultural depression symptom expression arXiv preprint arXiv:

Schulte-Mecklenbeck, hberger, Ranyard, (2011). process development testing process models judgment decision making.

Judgment Decision Making Shaik, Dann, McDonald, Redmond, Galligan, (2022). review trends challenges adopting natural language processing

methods

education feedback analysis. Access Sharma, Miner, Atkins, Althoff, (2023).

Human collaboration enables empathic conversations text-based peer-to-peer mental health support.

Nature Machine Intelligence

Sharma, Giannakos, (2020). Multimodal capabilities learning: multimodal about learning?

British Journal Educational Technology Shin, (2024).

Exploring automatic scoring mathematical descriptive assessment using prompt engineering GPT-4 model:

Focused permutations combinations. Mathematical Education Silva, Zhang, Kulvicius, Gail, Barreiros, Lindstaedt, Kraft, Poustka, Nielsen-Saines, tter, Einspieler, Marschik, (2021). future general movement assessment: computer vision machine learning scoping review.

Research Developmental Disabilities Stade, Stirman, Ungar, Boland, Schwartz, Yaden, Sedoc, DeRubeis, Willer, Eichstaedt, (2024).

Large language models could change future behavioral healthcare: proposal responsible development evaluation.

Mental Health Research Stadler, Herborn, Mustafi Greiff, (2020). assessment collaborative problem solving 2015: investigation validity tasks.

Computers Education Stamper, Xiao, (2024). Enhancing LLM-based feedback:

Insights intelligent tutoring systems learning sciences.

Olney, I.-A. Chounta, Santos, Bittencourt (Eds.), Artificial Intelligence Education.

Posters Breaking Results, Workshops Tutorials, Industry Innovation Tracks, Practitioners, Doctoral Consortium (Vol. 2150, Springer Nature Switzerland.

Takano, Ichikawa, (2022). Automatic scoring short answers using justification estimated BERT.

Proceedings Workshop Innovative Building Educational Applications 2022) Taubenfeld, Dover, Reichart, Goldstein, (2024).

Systematic biases simulations debates. Proceedings Conference Empirical

Methods

Natural Language Processing Tong, Huang, Zhao, Peng, (2024).

Automating psychological hypothesis generation

large language models causal graph. Humanities Social Sciences Communications Uymaz, Metin, (2022).

Vector based sentiment emotion

analysis

text: survey. Engineering Applications Artificial Intelligence Velthoven, Wang, Scherpbier, Chen, Zhang, Rudan, (2018).

Comparison messaging collection face-to-face interviews public health surveys: cluster randomized crossover study care-seeking childhood pneumonia diarrhoea rural china.

Journal Global HEALTH Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, (2017).

Attention need. Proceedings International Conference Neural Information Processing Systems Vosoughi, Aral, (2018). spread false online.

Science York, N.Y.) (6380), Pellegrini, Chan, Brown, Rosenquist, Vuijk, Doyle, Perlis, (2020).

Integrating questionnaire measures transdiagnostic psychiatric phenotyping using word2vec. e0230663.

Wang, Dong, Xuan, (2025). MLLM-tool: multimodal large language model agent learning.

IEEE/CVF Winter Conference Applications Computer Vision (WACV) Wang, Chen, Huang, Wang, (2023).

NeuralCD: general framework cognitive diagnosis. Transactions Knowledge Engineering Wang, Chen, (2023).

Large language models evaluators arXiv preprint arXiv:

Wang, (2022). semantic similarity tools automated content scoring fact-based essays written learners.

Education Information Technologies Wang, Chen, Zhao, Zhai, Yuan, Yang, (2024).

Exploring reasoning abilities multimodal large language models (MLLMs): comprehensive survey emerging

trends in multimodal reasoning . arXiv preprint arXiv: 2401.06805

J.-F., Estornell, (2024). Measuring reducing hallucination without gold-standard answers arXiv preprint arXiv:

Gong, Donbekci, Hirschberg, (2024). Beyond silent letters:

Amplifying

emotion recognition with vocal nuances . arXiv preprint arXiv: 2407.21315

Xiao, Kuang, Yang, Peng, Huang, (2024). HealMe: Harnessing cognitive reframing large language models psychotherapy.

Proceedings Annual Meeting Association Computational Linguistics (Volume Papers) Dong, Gabriel, Hendler, Ghassemi, Wang, (2024).

Mental-LLM: Leveraging large language models mental health prediction online data.

Proc. Interact. Wearable Ubiquitous Technol. 31:1-31:32.

Chen, Huang, Zhang, (2023). SECap: Speech emotion

captioning with large language model . arXiv preprint arXiv: 2312.10381

Yancey, Laflair, Verardi, Burstein, (2023). Rating short essays scale gpt-4.

Proceedings Workshop Innovative Building Educational Applications 2023) Yang, Wang, Chen, Wang, Huang, Song, Huang, (2024). agents psychology: study gamified assessments arXiv preprint arXiv:

Yang, Quan, Wang, (2023). PsyCoT: Psychological questionnaire powerful chain-of-thought personality detection.

Bouamor, Pino, (Eds.), Findings Association Computational Linguistics:

EMNLP 3320). Association Computational Linguistics.

Zhang, Koutsoumpis, Oostrom, Holtrop, Ghassemi, Vries, (2024). large language models assess personality asynchronous video interviews? comprehensive evaluation validity, reliability, fairness, rating patterns.

Transactions Affective Computing Zhang, Zhou, Z.-H. (2010).

Understanding bag-of-words model: statistical framework.

International Journal Machine Learning Cybernetics Zhang, Chen, Xiao, Feng, Zhang, (2025).

Automated generation personality

assessment: Development validation large-language-model-derived HEXACO situational judgment tests (SSRN Scholarly Paper 5378520).

Social Science Research Network. Zhao, Xing, Wang, Meng, Cheng, (2024).

Improving robustness large language models consistency alignment arXiv preprint arXiv:

Zhao, Zhang, Huang, Peng, Chen, (2024). Assessing understanding creativity large language models arXiv preprint arXiv:

Zheng, Chiang, W.-L., Sheng, Zhuang, Zhuang, Xing, Zhang, Gonzalez, Stoica, (2023).

Judging LLM-as-a-judge MT-bench chatbot arena arXiv preprint arXiv:

Zhang, (2022). Automatic short-answer grading BERT-based neural networks.

Transactions on Learning Technologies , 15 (3), 364 – 375.

Empowering psychometrics generative large language models:

Advantages, challenges, applications Xuetao Wenjie Zhihong

( 1 Faculty of Psychology, Beijing Normal University, Beijing 100875 , China )

( 2 Berkeley School of Education, University of California, Berkeley 94720 , US )

Mental Health Center, Central University Finance Economics, Beijing China

Abstract

Generative Large Language Models (LLMs), class artificial intelligence models pre-trained corpora textual data, present unprecedented opportunities challenges field psychometrics. paper synthesizes developmental trajectory interdisciplinary research between psychology summarize significant advantages empowering psychometrics, identify challenges their application, propose future research directions.

Specifically, ability generate coherent, context-aware natural language potential transform traditional assessment interaction paradigms.

Their advanced capabilities processing extensive texts multimodal allow comprehensive capture

analysis

participants' psychological information. Furthermore, facilitate real-time

analysis

personalized feedback, promoting shift outcome-based process-oriented evaluation.

Despite facing practical challenges related stability, creativity, scalability, demonstrate substantial promise applications Situational Judgment generation, collaborative problem-solving assessment, intelligent mental health diagnostics, quality analysis.

Keywords

large language models, psychometrics, artificial intelligence automated assessment, interactive testing

Submission history

Generative Large Language Models Empowering Psychometrics: Advantages, Challenges, and Applications