ChinaRxiv

Development of Main Effect DIF and Interaction DIF Detection Methods in Cognitive Diagnostic Assessment: A Recursive Partitioning Perspective

Kai Liu, Zhichen Guo, Qin Wang, Wang Daxun, Cai Yan, Tu Dongbo

Submitted 2025-12-09 | ChinaXiv: chinaxiv-202512.00054 | Mixed source text

Note: Figures in this paper have not yet been translated.

Abstract

In cognitive diagnostic assessment, differential item functioning (DIF) detection is an essential technical means for evaluating test fairness and measurement validity. However, existing cognitive diagnostic DIF detection methods are limited to main-effect DIF detection from the perspective of a single covariate, and there is a lack of effective detection tools for interactive DIF caused by the interaction of multiple covariates. To address this limitation, this study draws on the core ideas of recursive partitioning technology and proposes a new method (denoted as ISRPM) capable of simultaneously detecting main-effect DIF and interactive DIF in cognitive diagnostic assessments. Simulation results indicate that the overall performance of ISRPM in main-effect DIF detection is generally comparable to traditional methods; more importantly, its performance in interactive DIF detection outperforms traditional methods. The empirical study further supports the usability of this method, showing that ISRPM maintains high consistency with traditional DIF detection methods in its results and demonstrates potential advantages in identifying interactive DIF. Overall, the introduction of ISRPM is expected to further improve the accuracy of cognitive diagnostic DIF detection and promote the promotion and application of cognitive diagnostic assessment in psychological and educational measurement practices.

Full Text

Preamble

Detection of Main and Interaction Effects in Differential Item Functioning for Cognitive Diagnostic Assessment: A Recursive Partitioning Perspective

Abstract: In Cognitive Diagnostic Assessment (CDA), the detection of Differential Item Functioning (DIF) is a critical technical means for evaluating test fairness and measurement validity. However, existing DIF detection methods in CDA are largely limited to a single-covariate perspective focused on main effects, lacking effective tools for detecting interaction effects caused by the interplay of multiple covariates. To address this limitation, this study draws on the core principles of recursive partitioning to propose an Interaction-based Structural Recursive Partitioning Method (ISRPM) capable of simultaneously detecting main and interaction effects in CDA. Simulation results indicate that ISRPM's overall performance in main effect detection is generally comparable to traditional methods. More importantly, its performance in detecting interaction effects is superior to traditional approaches. Empirical research further supports the utility of this method, showing that ISRPM maintains high consistency with traditional methods in its detection results while demonstrating potential advantages in identifying interactive DIF. Overall, the introduction of ISRPM is expected to further enhance the fairness of cognitive diagnostics and promote the application of CDA in psychological and educational assessment practices.

Introduction

Cognitive Diagnostic Assessment (CDA) aims to provide fine-grained feedback on examinees' cognitive structures by identifying their mastery of specific attributes. As CDA is increasingly applied in high-stakes educational and psychological testing, ensuring the fairness and validity of these assessments has become paramount. Differential Item Functioning (DIF) occurs when examinees from different groups (e.g., gender, ethnicity, or socioeconomic status) who possess the same underlying ability or attribute profile have different probabilities of correctly answering an item.

Traditional DIF detection methods in CDA have primarily focused on "main effects"—that is, the influence of a single observed covariate on item performance. However, in complex real-world assessment environments, DIF may not be driven by a single factor alone. Instead, it may arise from the interaction of multiple covariates (e.g., the combined effect of gender and regional educational resources). Existing methodologies often lack the flexibility to detect these high-order interactions without pre-specifying the interaction terms, which can lead to undetected bias and compromised test validity.

To overcome these challenges, this study proposes the Interaction-based Structural Recursive Partitioning Method (ISRPM). By leveraging recursive partitioning—a machine learning-based approach—this method can automatically identify both main and interaction effects of covariates that lead to DIF.

Methodology

The ISRPM framework integrates the principles of structural equation modeling with recursive partitioning. Unlike traditional logistic regression or Mantel-Haenszel approaches that require researchers to define group memberships a priori, ISRPM treats covariates as potential splitting variables.

The Recursive Partitioning Approach

The core of ISRPM involves an iterative process:
1. Model Fitting: A base cognitive diagnostic model (such as the DINA or G-DINA model) is fitted to the entire sample.
2. Parameter Instability Testing: The method tests whether item parameters remain stable across the range of available covariates.
3. Optimal Splitting: If significant instability is detected, the sample is partitioned into subgroups (nodes) based on the covariate that maximizes the difference in item parameters.
4. Iteration: This process repeats recursively within each subgroup until no further significant differences are found or a stopping criterion is met.

This approach allows for the discovery of complex interaction patterns (e.g., an item may only exhibit DIF for female students in rural areas) that would be difficult to specify manually in traditional models.

Simulation Study

A series of simulation studies were conducted to evaluate the performance of ISRPM compared to traditional DIF detection methods (e.g., the Wald test and Likelihood Ratio Test).

Design

The simulation manipulated several factors:
- Sample size
- Number of attributes
- Type of DIF (Main effect vs. Interaction effect)
- Magnitude of DIF

Results

The results demonstrated that:
- Type I Error: ISRPM effectively controlled the Type I error rate under various conditions, maintaining it near the nominal level.
- Power for Main Effects: In scenarios involving only main effects, ISRPM showed detection power comparable to traditional CDA-DIF detection methods.
- Power for Interaction Effects: ISRPM significantly outperformed traditional methods when DIF was caused by the interaction of multiple covariates. While traditional methods often failed to identify these complex biases, ISRPM successfully partitioned the groups to reveal the underlying DIF structure.

Empirical Application

To demonstrate the practical utility of ISRPM, the method was applied to a real-world educational dataset. The analysis focused on identifying whether specific items in a mathematics assessment exhibited DIF across gender and socioeconomic status.

The empirical results showed that I

关键词

Cognitive Diagnostic Assessment, Differential Item Functioning, and Recursive Partitioning Techniques

Introduction

Cognitive Diagnostic Assessment (CDA) is a psychometric framework designed to measure students' specific knowledge structures and processing skills. Unlike traditional testing, which provides a single aggregate score, CDA offers a fine-grained profile of an examinee's strengths and weaknesses across multiple attributes. A critical challenge in ensuring the validity and fairness of these assessments is the detection of Differential Item Functioning (DIF). DIF occurs when examinees from different groups (e.g., based on gender, ethnicity, or socioeconomic status) who possess the same underlying ability or attribute profile have different probabilities of correctly answering an item.

Differential Item Functioning in CDA

In the context of Cognitive Diagnostic Models (CDMs), DIF can undermine the diagnostic accuracy of the assessment. If an item functions differently across groups, the resulting attribute profiles may be biased, leading to incorrect instructional interventions. Traditional methods for detecting DIF often require pre-defined focal and reference groups. However, in many practical scenarios, the variables that define these groups are complex, interacting, or unknown beforehand. This necessitates more flexible and data-driven approaches to identify potential sources of bias.

Recursive Partitioning Techniques

Recursive partitioning techniques, such as decision trees and random forests, have emerged as powerful tools for identifying DIF in complex datasets. These methods do not require the researcher to specify group boundaries a priori. Instead, they use a top-down, "divide and conquer" approach to split the sample into more homogeneous subgroups based on covariates.

In the framework of CDA, recursive partitioning can be integrated with diagnostic models to detect "DIF trees." This process involves:

Model Fitting: Estimating the parameters of a CDM (such as the DINA or G-DINA model) for the entire population.
Parameter Instability Testing: Using statistical tests to determine if item parameters remain constant across the range of available covariates.
Recursive Splitting: If significant instability is detected, the algorithm selects the covariate and the specific split point that maximizes the difference in item parameters between the resulting nodes.
Pruning: Applying stopping criteria to prevent overfitting and ensure that the identified subgroups are statistically meaningful.

Advantages and Applications

The application of recursive partitioning to DIF detection in CDA offers several advantages:

Interaction Detection: It can automatically identify complex interactions between multiple covariates (e.g., the combined effect

1 引言

In today's rapidly evolving information society, the rise of Cognitive Diagnosis has introduced a transformative evaluation system for test scores in the fields of psychological and educational measurement \cite{DiBello2006, LeightonGierl2007, Nichols1995}. Cognitive diagnosis focuses not only on the overall level of an individual's ability but also emphasizes revealing internal mental processing and cognitive structures. Consequently, it better achieves the core objective of testing: promoting individual development. (Supported by National Natural Science Foundation of China projects 62467002, 32300942, and 62167004).

In the field of psychological assessment, cognitive diagnosis can be used not only to evaluate an individual's cognitive functional state but also to accurately identify symptom characteristics. This provides critical data support for clinicians to implement precision treatment and early intervention \cite{Torre2018, Torre2023, TemplinHenson2006}. While traditional academic achievement tests typically assess student ability based on total scores or grades, cognitive diagnosis focuses more on the learning process itself. It precisely identifies an individual's strengths and weaknesses across different cognitive components, providing an effective reference for educators to formulate targeted teaching strategies and knowledge remediation plans \cite{Rupp2010}. Against the backdrop of rapid technological development, cognitive diagnosis has not only broadened the research perspective of psychological and educational measurement but also provided effective technical support for personalized instruction and precision psychotherapy. With its unique advantage in providing fine-grained diagnostic information, cognitive diagnosis has become a research frontier in psychological and educational measurement both domestically and internationally, and is widely applied in test development.

A primary concern for test developers in cognitive diagnostic testing is whether measurement results exhibit systematic bias toward specific groups, which could lead to an unfair advantage or disadvantage for those groups. This issue essentially involves the evaluation of test fairness. Within the psychometric framework, concepts closely related to test fairness primarily include measurement invariance \cite{Meredith1993} and Differential Item Functioning (DIF) \cite{HollandWainer1993}. Measurement invariance refers to a test maintaining consistent measurement properties across different groups of examinees, such as those defined by gender or cultural background. When measurement properties differ systematically between groups, it indicates measurement noninvariance. When this noninvariance is manifested at the item level, it signifies the presence of DIF. Under the cognitive diagnosis framework, DIF is typically defined as a systematic difference in the probability of a correct response on the same item for examinees from different groups who share the same attribute mastery pattern \cite{Hou2014, Zhang2008}. Existing research indicates that the presence of DIF not only undermines the fairness of cognitive diagnostic tests but may also reduce measurement validity \cite{Hou2014}. Furthermore, it can lead to biased estimates of item parameters, resulting in the misclassification of examinees' attribute mastery patterns and leading to misleading assessment results \cite{Paulsen2020}. Therefore, the detection of DIF is a critical step in the development and validation stages of cognitive diagnostic tests.

分析

This has become a key step widely recognized by psychometric researchers \cite{2014; Wang, 2015; 2020}. This process is not only a vital component of test quality control but also a necessary condition for ensuring test fairness and measurement validity.

Researchers both domestically and abroad have proposed parametric and non-parametric methods suitable for cognitive diagnosis. Non-parametric methods offer the advantages of being easy to understand and applicable to various samples; however, compared to parametric methods, their detection accuracy is typically lower. Typical representatives of these types of methods include the Mantel-Haenszel and SIBTEST approaches \cite{Zhang, 2006}. In contrast, parametric methods require the estimation of specific parameters during their application.

While the operation of parametric methods for specific cognitive diagnostic models is relatively complex and computationally expensive, they perform better in terms of the accuracy of detection results. Common parametric methods include those based on the Wald test \cite{2014; 2008; 2021}, the Logistic regression approach \cite{2015}, and the Likelihood Ratio Test (LRT) \cite{2021}. It is worth noting that the detection performance of these parametric methods has been fully validated in simulation studies for cognitive diagnostic tests.

分析

This research provides reliable theoretical and technical support for differential item functioning (DIF) detection. Given the precision advantages of parametric methods in DIF detection, they have received more attention in recent years compared to non-parametric methods. Therefore, this paper focuses on the development of parametric cognitive diagnostic DIF detection methods. Although existing cognitive diagnostic DIF detection methods generally perform well, they still possess certain limitations. These methods typically only evaluate whether a single covariate triggers DIF independently, without fully considering that the interaction between multiple covariates may also lead to DIF. For example, the interaction between gender and household registration may influence a participant's response patterns on specific items. To clearly distinguish between DIF caused by the interaction of multiple covariates and DIF caused by a single covariate, this paper defines the former as interactive DIF (DIF-I) and the latter as main effect DIF (DIF-M). Existing research indicates that DIF-I may be prevalent in psychological and educational testing, further increasing the complexity of sources of measurement bias. For instance, Bauer (2017) found that measurement bias caused by interactions between covariates was significant in psychological tests assessing adolescent delinquent behavior. Berger (2016) found in intelligence structure tests that the DIF of certain items might stem from the interaction between gender and age. We have reason to speculate that DIF-I may exist in cognitive diagnostic tests and could adversely affect test fairness and measurement validity. Numerous researchers have emphasized the importance of interactive DIF detection in test quality analysis and its theoretical and practical value \cite{Belzak2023, Strobl2015, Berger2016}. Detecting DIF-I helps to more comprehensively reveal the complex sources of functional differences in tests, thereby providing test developers with a more valuable basis for item revision. Collins (1990) pointed out in his research that individual identity is the result of the intersection of multiple demographic characteristics. This perspective suggests that when exploring measurement bias, one should consider the possible interactions between variables in addition to the main effects of single demographic variables. Based on this understanding, compared to DIF detection from the perspective of a single covariate, identifying DIF-I helps to identify more potential sources of measurement bias, providing test developers with more accurate and targeted evidence for item revision. The hidden nature of DIF-I makes its detection particularly necessary, as it is more difficult to identify using traditional methods. This is primarily because traditional DIF detection methods usually assume that different demographic variables are independent of each other, meaning they can only identify DIF caused by a single covariate and struggle to reveal measurement bias caused by the interaction of multiple covariates. During the test development stage, researchers often pre-assume possible sources of measurement bias based on single demographic variables such as gender or race, while ignoring the potential impact of interactions between variables on test fairness. This neglect may lead to some measurement bias being overlooked.

The existence of DIF-I affects the fairness of the test; accurately identifying interactive DIF helps to further enhance the fairness of cognitive diagnostic assessment results. Finally, the identification of DIF-I is of great value for further improving measurement validity. The presence of interactive DIF may not only affect a subject's response performance on specific items but may also hinder the accurate estimation of their attribute mastery patterns. If DIF triggered by interactions between covariates is not identified, the quality of cognitive diagnostic assessment results will be weakened. Therefore, DIF-I detection is of great value for comprehensively revealing sources of measurement bias, ensuring test fairness, and improving measurement validity. However, existing cognitive diagnostic DIF detection methods still have obvious deficiencies in identifying interactive DIF, which poses a challenge to the effective guarantee of test fairness and measurement validity. Therefore, developing a method within the cognitive diagnostic framework that can simultaneously identify main effects and interactive DIF not only helps to refine the theory and technology of cognitive diagnostic DIF detection but also holds significant value for promoting the rational application of cognitive diagnostic assessment in practice. Based on this research need, this study aims to propose a method capable of simultaneously identifying main effects and interactive DIF, thereby providing more comprehensive technical support for the fairness evaluation of cognitive diagnostic tests.

With the continuous advancement of modern data processing technology, data mining techniques have been widely applied in the fields of psychological and educational measurement. Belzak and Bauer (2020) pointed out that the process of DIF detection is highly similar to variable selection in regression modeling, which provides a theoretical basis for applying variable selection methods to DIF detection. This has offered new ideas for the improvement and innovation of DIF detection methods. Compared to traditional DIF detection methods, data mining technology offers advantages such as high efficiency, strong flexibility, and the ability to process multiple covariates simultaneously, demonstrating great potential in identifying complex forms of DIF \cite{Belzak2023}. Based on the potential of variable selection methods in DIF detection, researchers have begun to combine Item Response Theory (IRT) with variable selection techniques to develop a series of new methods capable of identifying complex DIF forms \cite{Bollmann2018, Strobl2015, Berger2016}, providing important methodological references for conducting interactive DIF detection in cognitive diagnostic contexts. Recursive Partitioning is the most representative variable selection technique among these methods.

The basic principle of this technology is to recursively partition the feature space covered by the predictor variables into several sub-regions and fit a relatively simple model within each region \cite{Hothorn2006}. By continuously executing data segmentation and modeling, the method can capture complex non-linear relationships and interactions between variables.

方法

It can intuitively reveal the relationship between the main effects and interactions of covariates and test item parameters, thereby providing effective technical support for exploring the complex sources of measurement bias. According to Strobl (2015), Berger (2016), and Bollmann (2018), the advantages of this analysis are specifically reflected in the following three aspects: it breaks the limitation of traditional methods that require manual division into focal and reference groups prior to analysis, and it can automatically identify the optimal covariate grouping criteria in a data-driven manner. This reduces the risk of potential measurement bias being overlooked due to improper grouping settings.

Traditional methods often face difficulties in examining the interactions between covariates, leading to significant limitations in interaction detection. This technology can not only reveal complex interactions among multiple covariates but also further evaluate their impact on measurement bias, thereby helping to improve the detection accuracy of interactive effects.

Recursive Partitioning (RP) technology can flexibly handle various types of variables, including continuous, ordinal, and binary variables, thus broadening the scope of application for detection methods. Taken together, the technology provides an effective framework for detection due to its flexible, intuitive, and robust characteristics, demonstrating unique advantages in handling complex scenarios involving interactions between covariates.

To date, the development of detection methods has primarily been carried out within the framework of Item Response Theory (IRT). These methods can be roughly divided into two categories: global-level methods and item-level methods. Global-level methods function by testing parameter instability within the spatial range covered by the covariates to determine whether the main effects of a single covariate or the interactions of multiple covariates lead to measurement bias.

方法

While existing research can identify the presence of Differential Item Functioning (DIF), it often fails to provide further localization. Typical representatives of such methods in previous studies include Rasch Trees (Strobl et al., 2015) and Rasch Trees for polytomous scoring (Komboz et al., 2018), which focus on global detection.

Compared to global-level approaches, item-level methods not only identify the covariates that induce DIF but also pinpoint the specific items where DIF occurs. Consequently, these methods exhibit higher flexibility and practical utility in DIF detection. Typical representatives of this category include item-focused trees (IFT; Berger & Tutz, 2016) and item-focused trees based on the partial credit model (PCM-IFT; Bollmann et al., 2018). These methods have garnered significant attention due to their dual advantages in identifying covariates and localizing problematic items. Furthermore, such methods can simultaneously process multiple covariates in a single analysis (Finch et al., 2015) and further explore whether interactions between these covariates induce functional differences at the item level.

Although existing research has demonstrated that tree-based techniques possess significant research potential and application value for detecting main effects and interactions, current studies have only validated their effectiveness for DIF detection within the framework of Item Response Theory (IRT). More importantly, there are currently no published studies systematically exploring interaction-based DIF detection within cognitive diagnostic tests. It is worth noting that Cognitive Diagnostic Theory (CDT) differs significantly from IRT in terms of model assumptions, measurement objectives, and data analysis methods. These differences present new challenges when directly applying tree-based techniques to cognitive diagnostic contexts. Therefore, extending tree-based techniques to DIF detection in cognitive diagnostic assessments and verifying whether they can maintain their established detection accuracy and applicability remains an important issue requiring in-depth exploration. Drawing on the core ideas of tree-based techniques, this study proposes and validates a new DIF detection method suitable for cognitive diagnostic tests. This method aims to provide effective technical support for identifying main effects and interactions in cognitive diagnostic assessments, thereby further improving test fairness and ultimately promoting the deep application of cognitive diagnostic technology in the fields of psychological and educational measurement.

2 基于项目水平的序列递归分割法

Abstract

The primary objective of this paper is to develop a novel cognitive diagnostic model within the framework of Intelligent Systems for Research and Performance Monitoring (ISRPM). Cognitive diagnosis aims to identify the specific knowledge structures and processing skills of learners by analyzing their response patterns to assessment items. Traditional models often struggle with high-dimensional latent spaces and complex interactions between attributes. To address these challenges, we propose a robust diagnostic framework that integrates advanced machine learning techniques with psychometric theory. Our approach enhances the interpretability of diagnostic results while maintaining high classification accuracy. Experimental results on both simulated and empirical datasets demonstrate that the proposed model outperforms existing benchmarks in terms of attribute profile recovery and model fit indices.

1. Introduction

Cognitive Diagnosis Models (CDMs), also known as Diagnostic Classification Models (DCMs), have become essential tools in educational and psychological measurement. Unlike traditional Item Response Theory (IRT) models that provide a single continuous score representing overall ability, CDMs provide a fine-grained profile of a learner's specific strengths and weaknesses across a set of predefined attributes. This multidimensional feedback is crucial for personalized learning and targeted remedial instruction.

However, as the number of attributes increases, the computational complexity of traditional CDMs grows exponentially. Furthermore, many existing models rely on strict parametric assumptions that may not hold in real-world testing scenarios. In this study, we introduce a new methodology designed to overcome these limitations by leveraging the flexibility of deep learning architectures within the rigorous constraints of cognitive diagnostic theory.

2. Methodology

2.1 The Cognitive Diagnostic Framework

The proposed model is built upon the Q-matrix, which defines the relationship between items and attributes. Let $\mathbf{Q} = {q_{ij}}$ be an $J \times K$ matrix, where $q_{ij} = 1$ if item $j$ requires attribute $k$, and $q_{ij} = 0$ otherwise. The latent state of a learner $i$ is represented by a binary vector $\alpha_i = (\alpha_{i1}, \dots, \alpha_{iK})$, where $\alpha_{ik} \in {0, 1}$ indicates the mastery of attribute $k$.

[FIGURE:1]

2.2 Model Specification

We define the probability of a correct response $X_{ij}$ for learner $i$ on item $j$ as a function of their attribute profile and the item parameters. To account for non-linear interactions, we employ

方法

The Item-based Sequential Recursive Partitioning Method (ISRPM) is an approach that integrates recursive partitioning techniques with Cognitive Diagnosis Models (CDMs). This method utilizes test statistics constructed from model parameter estimates as the criteria for covariate splitting. By comparing the effectiveness of various covariate partitioning schemes, ISRPM identifies the optimal splitting method to generate a recursive partitioning tree for each item that reflects its performance patterns.

At each level of the tree, ISRPM evaluates potential splitting schemes for candidate covariates by calculating statistics that quantify differences in item parameters. It selects the specific covariate and split point that maximizes the disparity in item parameters between the resulting subgroups. The participant population is then recursively partitioned based on these criteria, expanding the tree structure layer by layer. This process clarifies which covariates induce differential item functioning and reveals the specific mechanisms through which they operate. The following sections provide a brief introduction to the cognitive diagnosis model employed in this study, as well as the operational steps and theoretical foundations of the ISRPM.

2.1 拓广

Cognitive Diagnosis Models (CDMs) serve as psychometric models that fully integrate cognitive variables within the framework of cognitive diagnosis. As the core technical component of cognitive diagnostic assessment, the quality of these models directly determines the validity of diagnostic results \cite{2019}. Researchers have developed a variety of CDMs with excellent diagnostic performance that are applicable to different testing contexts and theoretical assumptions. This study adopts the Generalized Deterministic Input, Noisy "And" gate (G-DINA) model. Proposed by de la Torre \cite{2011}, the G-DINA model is a generalized psychometric model extended from the Deterministic Input, Noisy "And" gate (DINA) model. Under the G-DINA framework, examinees are subdivided into $2^{K_j}$ attribute mastery patterns based on the number of attributes ($K_j$) measured by a specific item $j$. To introduce its mathematical expression, assume the first $K_j$ attributes are the ones required to answer item $j$ correctly. The conditional probability of a correct response on item $j$ can be expressed as a function of the attribute mastery pattern. This model utilizes a link function to relate the latent attributes to the response probability; depending on the link function used, the G-DINA model can take different forms. The three most commonly used link functions are the identity, log, and logit links. In this context, the intercept represents the probability of a correct response when the examinee has mastered none of the required attributes, which is generally non-negative. Larger parameter values indicate that mastering a specific attribute contributes more significantly to the probability of a correct response. These parameters represent the mastery status of examinees across different patterns, including the main effects of individual attributes and their interaction effects.

The model also accounts for the interaction effects among all measured attributes. It should be emphasized that, to maintain consistency with previous similar studies \cite{2013, 2020, 2022}, an identity link function was employed in this research.

2.2 G-DINA

What needs to be redefined is the examinee's mastery of discrete attributes, rather than positioning the examinee along a continuum of latent trait levels. Under the G-DINA model framework (2014), this can be expressed as:

$$P_j(\alpha_l^{(G)}) = P(X_{ij} = 1 | \alpha_i = \alpha_l, G)$$

where $P_j(\alpha_l^{(f)})$ and $P_j(\alpha_l^{(r)})$ represent the probability of a correct response to item $j$ for examinees in the focal group (f) and reference group (r), respectively, who possess the attribute mastery pattern $\alpha_l$.

2.3 基于项目水平的序列递归分割法

Development of ISRPM: Basic Principles and Detection Procedures

This section details the operational workflow and key technical details of the Item-Specific Recursive Partitioning Method (ISRPM). The ISRPM integrates recursive partitioning technology with the G-DINA model to achieve the detection of main effects and interactions by recursively partitioning covariates. The basic logic of the method is as follows: for each item, given a set of candidate covariates, ISRPM first identifies all possible split points for each covariate and divides the examinee response data into several sub-samples based on these split points. The G-DINA model is then fitted to each sub-sample to estimate model parameters. These parameters are substituted into a preset splitting criterion (test statistic) for calculation. The split point corresponding to the maximum value of the statistic—representing the optimal covariate and its split point—is identified as the first optimal splitting variable. After completing the initial split, ISRPM repeats this process within the newly generated sub-samples, sequentially determining the optimal splitting scheme at each level until the preset termination rules are met. Finally, ISRPM generates a corresponding recursive partitioning tree to visually demonstrate the conditions under which the item exhibits differential item functioning (DIF). The following sections describe the main operational steps and technical details of ISRPM.

Step 1: Determining Covariates of Interest and Potential Split Points

In cognitive diagnostic DIF detection, the root node of the recursive partitioning tree includes all covariates of interest and their levels. These levels collectively constitute the feature space covered by the covariates, while the child nodes represent subsets within this feature space. From the perspective of examinee grouping logic, this structure of stepwise refinement based on the feature space is consistent with the traditional idea of grouping examinees according to covariate levels in DIF detection.

In traditional DIF detection, examinees are divided into a Focal group ($G_F$) and a Reference group ($G_R$) based on each covariate of interest and its levels. These groups represent different levels of the same demographic characteristic. Each covariate contains two levels; for a binary covariate, the corresponding groups can be defined according to its levels. The entire feature space covered by all covariates defines the initial focal and reference groups. When a variable contains only two levels, there is one and only one potential split point.

Step 2: Searching for the First Optimal Splitting Covariate Based on Selected Criteria

After determining the covariates and potential split points, ISRPM performs parameter estimation on the grouped datasets corresponding to each potential splitting scheme. To ensure reliable results, this study employs the widely validated G-DINA model for estimation. Once parameter estimates are obtained, the key task is to determine an appropriate covariate splitting criterion to identify the first optimal splitting covariate and judge whether the current item exhibits DIF on that variable. In the recursive partitioning framework proposed by Strobl (2009), splitting criteria generally fall into two categories: those based on impurity measures (e.g., Gini Index, Shannon Entropy) and those based on test statistics (e.g., Log-likelihood statistic). The former achieves partitioning by measuring the homogeneity of samples within a node, while the latter uses statistical tests of parameter differences to determine if further partitioning is necessary. In parametric cognitive diagnosis, splitting based on test statistics is more aligned with research objectives; therefore, this study adopts a test statistic as the ISRPM covariate splitting criterion.

When searching for the first optimal splitting covariate, ISRPM generates candidate child nodes for all potential split points of each candidate covariate. These nodes constitute the potential left and right child nodes of the first level of the recursive partitioning tree. Based on all feature levels of the covariate, focal and reference group datasets defined by its potential split points are obtained, and the G-DINA model is fitted to each group to estimate the corresponding item parameters.

In the recursive partitioning tree, if the G-DINA model item parameters under different child nodes satisfy specific conditions, they are denoted using the subscripts "L" and "R" (representing the left and right child nodes, respectively). Let $G_L$ and $G_R$ represent the focal and reference groups defined by covariate $X$. The intercept parameter corresponding to the left child node established based on $X$ can be viewed as the performance of the focal group defined by $X$ on the current item, and so forth for other parameters.

After obtaining the estimated item parameters for the current item under each candidate covariate $X$, it is necessary to calculate the corresponding test statistic for each potential split point. Any test statistic validated in cognitive diagnostic DIF detection research can be used. Previous studies have found that under the cognitive diagnosis framework, detection methods based on the Wald statistic perform better than SIBTEST or likelihood ratio tests in terms of Type I error control and statistical power \cite{hou2022}. Therefore, this study selects the Wald statistic as the covariate splitting criterion. The Wald statistic follows a chi-square distribution, and its mathematical expression is:

$$\begin{aligned} W = (\hat{\beta}_R - \hat{\beta}_F)^T (\Sigma_R + \Sigma_F)^{-1} (\hat{\beta}_R - \hat{\beta}_F) \end{aligned}$$

where $\hat{\beta}_R$ and $\hat{\beta}_F$ are the vectors of item parameter estimates for the reference and focal groups on item $i$, respectively, and $\Sigma_R$ and $\Sigma_F$ are their corresponding sampling variance-covariance matrices. It is worth noting that DIF detection usually requires item purification to place parameters from different groups on the same scale, ensuring comparability \cite{magis2010}. This study does not perform item purification for three primary reasons: First, from the perspective of scale consistency, model parameters in the cognitive diagnosis framework are naturally on the same scale because they are not based on a continuous latent trait, leading to a lower dependency on purification. Second, the G-DINA model measures the mastery of discrete attributes rather than continuous ability levels, thus requiring no additional parameter scaling \cite{yuan2021}. Third, regarding the applicability of purification procedures, although they can improve accuracy, they are cumbersome in practice, involving multiple rounds of estimation and dynamic adjustment of anchor sets \cite{meade2012}. Furthermore, purification does not always guarantee a pure anchor set \cite{yuan2021}; when a test contains multiple DIF items, masking and swamping effects may occur, weakening the accuracy of the anchor set \cite{barnett1994}. Finally, looking at international literature, the vast majority of studies do not use purification.

Studies using the Wald test for DIF detection in cognitive diagnosis have consistently omitted item purification procedures \cite{hou2013, de2014, wang2015, li2021, mehrazmay2022}. Following these precedents, this study also foregoes item purification.

After calculating the Wald statistics for all candidate covariates and their potential split points, ISRPM ranks the covariates in descending order based on the magnitude of the statistic. The covariate corresponding to the maximum value is identified as the first optimal splitting covariate. For the convenience of subsequent explanation, this step assumes the first optimal splitting covariate is the first among all covariates ($X_1$). The examinee sample can then be divided into two groups, corresponding to the left and right child nodes of the recursive partitioning tree.

It should be noted that once the first optimal splitting covariate is determined, the DIF detection process for the item on that specific covariate is considered complete. In subsequent covariate searches and partitioning, that variable is removed from the candidate set to avoid redundant partitioning.

Step 3: Recursively Searching for Optimal Splitting Covariates and Split Points in Remaining Covariates

After identifying the first optimal splitting covariate, ISRPM does not stop; instead, it continues recursive partitioning based on the two generated child nodes to gradually expand the hierarchical structure of the tree. ISRPM continues to search for the optimal splitting covariates and split points for subsequent levels among the remaining covariates. Specifically, this recursive process involves the following: based on the optimal split point of the previously identified covariate, ISRPM divides the examinee sample into two parts corresponding to the two feature levels defined by that split. These levels are then fully crossed with the potential split points of the remaining covariates to generate new combinations of feature levels.

When $X_1$ is determined as the first optimal splitting covariate, its two levels are crossed with the two feature levels of a remaining covariate $X_2$, forming four new level combinations.

ISRPM redefines the focal and reference groups based on these newly generated feature combinations. For example, by crossing the right child node established by $X_1$ with the two feature levels of covariate $X_2$, two new feature combinations are obtained. The former can be regarded as the new focal group and the latter as the new reference group, both of which are used for subsequent model fitting and Wald statistic calculation.

ISRPM selects the covariate and split point with the largest Wald statistic as the optimal splitting scheme for the current level. If the calculation results show that the split is significant, the process continues.

By crossing all levels of the right child node, ISRPM further generates two new child nodes. As the tree structure expands, ISRPM adjusts the formulas based on the new partitioning levels. For instance, the intercept parameter for a newly established child node is denoted accordingly, and so forth for other parameters.

Step 4: Repeating Step 3 Until Termination Rules are Met

During the repeated search for optimal splitting covariates and split points, the data undergoes multiple rounds of partitioning. As the depth of the tree increases, the sample size assigned to each subsequent child node gradually decreases. To ensure each child node has a sufficient sample size for accurate parameter estimation and reliable DIF detection, the following two termination rules are set: 1) The covariate search space is exhausted; 2) The sample size of any child node falls below a preset minimum threshold (set to 100 in this study). When either condition is met, ISRPM immediately terminates the search and partitioning process for the current item.

Step 5: DIF Detection and Output of the Recursive Partitioning Tree

When the search for optimal splitting covariates and split points meets the termination conditions, the process for the current item stops, and a decision is made whether to output a recursive partitioning tree. Specifically, there are two possibilities for a single item: if no meaningful covariate splits occur during the search (i.e., the p-values for the Wald statistics are higher than the preset significance level), the item is judged to have no DIF, and no tree is generated. If at least one covariate split is meaningful (p-value $\leq$ significance level), the item is judged to have DIF, and the corresponding recursive partitioning tree is output. This tree visually displays on which covariates and in what manner the item exhibits DIF. To facilitate understanding of the schematic trees for different items: each child node displays the G-DINA model item parameters for different groups; the arrow type reflects whether the Wald statistic reached statistical significance (solid arrows indicate significance, dashed arrows indicate non-significance). The arrow type also identifies the presence of DIF.

Solid arrows indicate the presence of DIF, while dashed arrows indicate its absence. Overall, the items may exhibit three different DIF patterns:
1. Main Effect Only: This is characterized by solid arrows appearing only at all child nodes of the first level of the tree.
2. Special Main Effect: If the difference between groups is roughly equivalent across covariates, it indicates the item exhibits only main effects on two covariates without an interaction.
3. Interaction Only: This is primarily reflected when dashed arrows appear at the first level child nodes, but solid arrows appear at all second-level child nodes. This indicates that DIF can only be identified after crossing all levels of the two covariates.
4. Mixed Effects: Solid arrows appear at all first-level child nodes and the left-side second-level child nodes, while the right-side second-level nodes have dashed arrows, indicating the item exhibits both main effects and interactions.

2.4 一类错误控制

Performing ISRPM based on multiple covariates inevitably involves the issue of multiple testing. In the statistical literature, a critical concept in this context is the familywise error rate (FWER). As defined by Benjamini and Hochberg, the familywise error rate refers to the probability of making at least one Type I error (incorrectly rejecting a true null hypothesis) when simultaneously testing a set of null hypotheses.

方法

When conducting simultaneous hypothesis testing across multiple covariates for a single item, a Type I error is considered to have occurred if any one of these tests results in a false rejection, leading to the item being incorrectly identified as exhibiting differential item functioning (DIF). To effectively control the Family-Wise Error Rate (FWER) during the ISRPM detection process at a given global significance level, this study follows the approach of Berger (2016) by applying a more stringent significance level to each individual test. Specifically, the local significance levels are adjusted based on the overall significance level to maintain control over the global Type I error rate.

This study employs the Bonferroni adjustment method to achieve Type I error control. This method derives the local significance level for each covariate by dividing the overall significance level (set at 0.05 in this study) by the number of simultaneous hypothesis tests performed \cite{Strobl2015, Berger2016, Bollmann2018}. When testing a single item for potential effects across multiple covariates, the overall significance level must be proportionally adjusted according to the number of covariates to counteract the increased probability of false positives inherent in multiple testing. This ensures that the probability of an item being misidentified as having DIF does not exceed $\alpha$. The local significance level for the Bonferroni-adjusted ISRPM can be calculated using the following formula:

where $\alpha$ refers to the global significance level (set to 0.05 in this study) and $m$ refers to the number of covariates being tested.

[FIGURE:1] provides a schematic diagram of the recursive partitioning trees for different items under the two-covariate scenario in the ISRPM.

3 Monte

Carlo

3.1 实验设计

Simulation Study Design

The primary objective of this simulation study is to evaluate the performance of the proposed method across various experimental conditions and compare it with several widely used international cognitive diagnostic detection methods (2021). A total of eight manipulated variables were established: (1) Sample size, with three levels; (2) DIF type (DIF); (3) DIF effects, including main effects only, interaction effects only, and simultaneous presence of both; (4) Item quality, featuring high and medium quality levels, where high-quality item parameters were randomly drawn from a uniform distribution $U(0.05, 0.15)$, and medium-quality parameters from $U(0.15, 0.25)$; (5) DIF detection methods, including ISRPM, Wald-FS, and LR-FS; (6) Distribution of attribute mastery patterns, including uniform and multivariate normal distributions; (7) Correlation between attributes, categorized as high or moderate; and (8) The influence of demographic covariates on the distribution of attribute mastery patterns, with two levels: presence or absence of influence. It should be noted that when the attribute mastery pattern follows a uniform distribution, it is not possible to directly manipulate the correlation between attributes or the influence of demographic covariates. Therefore, these two factors were only manipulated under the condition where the attribute mastery pattern follows a multivariate normal distribution. Specifically, when demographic covariates have no influence on the distribution of the target and reference groups, both groups follow a multivariate normal distribution. Under the influence condition, the attribute mastery pattern follows a multivariate normal distribution, while the reference group's pattern is adjusted accordingly.

By fully crossing these eight manipulated factors, this study initially generated a set of experimental conditions. However, because the uniform distribution condition for attribute mastery patterns could not be combined with the correlation between attributes or the influence of demographic covariates, several invalid combinations were removed. Ultimately, a specific number of valid simulation conditions were retained. During the experimental process, each condition was replicated multiple times. All simulation experiments were conducted using R (Team, 2021). The following fixed conditions were established: (1) The number of cognitive attributes was fixed at a value common in cognitive diagnostic detection research (2014; Wang, 2015; 2016; 2022); (2) The number of attributes measured per item was limited to a maximum of a certain number (2016); and (3) The proportion of DIF items relative to the total test length was set according to established literature (2021).

Following widely used designs in previous research (2014; 2021; 2016), each attribute was measured an equal number of times by the items, and the number of items for each attribute was also kept consistent. Regarding the DIF conditions (2021), the number of covariates across all experimental conditions followed the settings of Strobl (2015) and Berger (2016) for each covariate.

3.2 不同

The simulation and detection of response data were conducted using the GDINA package (de la Torre, 2020; Ma & de la Torre, 2021). The specific operational steps are as follows: First, the simGDINA function was used to generate the person parameters and item parameters for the reference group, based on which the response data for the reference group were simulated. Subsequently, response data for the focal group were generated. The probability of a correct response for a participant in the focal group on a specific item was determined based on the correct response probability of the reference group for that same item, adjusted by the calculation method for $\text{DIF}$ used in this study. After generating the response data for both groups, the six detection methods considered in this study (ISRPM, Wald-FS, LR-FS, Wald-RS, LR-RS, and Wald-W) were applied to output the corresponding detection results. It should be noted that no item purification procedures were implemented for any of the detection methods in this study.

Regarding the calculation method for $\text{DIF}$, the study considered scenarios involving only main effects, only interaction effects, and the simultaneous presence of both main and interaction effects. In this study, the values of $\text{DIF}$ were set to $0.1$, $0.15$, and $0.2$, representing small, medium, and large effect sizes, respectively.

3.3 评价指标

Previous research in cognitive diagnosis has commonly employed the True Positive Rate (TPR) and the False Positive Rate (FPR) to evaluate the performance of detection methods. In this context, TPR is equivalent to statistical power—the ability to correctly identify items that exhibit differential item functioning (DIF)—while FPR corresponds to the Type I error rate, representing the probability of incorrectly flagging a non-DIF item as having DIF. However, these studies have typically focused on the item level and considered only a single covariate, which limits the evaluation to the detection performance relative to that specific variable.

In contrast, this study further considers scenarios where multiple covariates simultaneously contribute to the occurrence of DIF. In such cases, evaluating the performance of a detection method should focus not only on whether it can correctly identify the presence of DIF but also on whether it can accurately locate the specific covariates responsible for the DIF. Consequently, in addition to calculating metrics at the item level, we also compute these two indicators at the intersection of items and covariates to more comprehensively evaluate the overall effectiveness of the detection method. Let the detection result vector for an item across multiple covariates be denoted such that if any element in the vector is non-zero, the item is identified as a DIF item. If all elements in the vector are zero, the item is classified as a non-DIF item. Based on this framework, the evaluation metrics are calculated as follows:

The True Positive Rate at the item level, the False Positive Rate at the item level, the True Positive Rate at the item-covariate combination level, and the False Positive Rate at the item-covariate combination level are defined. In the aforementioned formulas, $I(\cdot)$ denotes the indicator function, which takes the value of 1 if the condition is satisfied and 0 otherwise.

3.4 实验结果

Due to space constraints, the main text only reports the results under the condition where the attribute mastery patterns follow a uniform distribution.

The overall performance regarding the statistical power and Type I error rates of the detection methods across various experimental conditions is presented in [TABLE:N]. The corresponding complete numerical results are listed in the supplementary tables in the Appendix. Furthermore, when the attribute mastery pattern distribution follows a multivariate normal distribution, the statistical power and Type I error rate results for each detection method under different experimental conditions are presented collectively in the online appendix. Overall, the ISRPM demonstrates acceptable detection performance under most experimental conditions, and its overall performance is superior to the existing FS-wald and FS-LR methods.

方法

The following sections will further analyze the effects of different manipulated variables on statistical power and Type I error rates.

3.4.1 统计检验力

The statistical power results of the detection methods at both the item level and the component level under different experimental conditions are presented. When the Differential Item Functioning (DIF) of an item is jointly caused by the main effects and interaction effects of multiple covariates, relying solely on item-level indicators may fail to comprehensively reflect the detection capabilities of a given method. Specifically, item-level indicators only measure whether a detection method can correctly identify the presence of DIF, without further considering which specific covariates lead to that DIF. Even if a method identifies only a subset of the covariates responsible for the DIF, it is still recorded as a correct identification at the item level, which may lead to an overestimation of the method's detection performance. In contrast, component-level indicators not only examine whether the DIF item is correctly identified but also evaluate whether the detection method can accurately locate the specific covariates triggering the DIF. Therefore, the component level provides a more detailed evaluative perspective, better reflecting the detection performance of methods in scenarios involving multiple covariates.

As can be seen from the results, when interaction effects are present in the test, the statistical power of the [METHOD_NAME] performs better than both FS-wald and FS-LR under most conditions.

方法

As the sample size per group increases, the statistical power of all methods improves significantly, indicating that sample size is a critical factor influencing detection performance. Overall, ISRPM outperforms the five traditional methods in terms of statistical power. This advantage is particularly pronounced when only interactions are present. However, when the sample size per group is limited to $N=100$, the statistical power of ISRPM is constrained. Under this condition—whether the test contains only main effects or both main effects and interactions—ISRPM performs slightly worse than the other five methods despite the low power across all methods. This result may be related to the fact that during the recursive partitioning process of ISRPM, the sample sizes assigned to child nodes become too small, leading to decreased parameter estimation precision and reduced detection accuracy. Specifically, under the condition of two binary covariates, the sample size assigned to each child node in the first split of ISRPM is $N=50$. If the process proceeds to a second split, the sample size for the four child nodes drops to $N=25$. In such cases, insufficient sample size may increase parameter estimation error, thereby affecting the accuracy of the detection results. Notably, at the same sample size, ISRPM yields relatively better detection results when DIF is caused solely by interactions between covariates. This suggests that the method's ability to account for interactions between multiple covariates during the analysis process allows for higher sensitivity in identifying interactive DIF.

The statistical power of all methods increases significantly as sample size grows. When only main effects exist in the test and the sample size per group is $N=500$, the power of ISRPM rises to 0.99–0.99, while the ranges for FS-Wald and FS-LR are 0.67–0.75 and 0.48–0.49, respectively. At $N=1000$, these values reach 0.96–0.96 for the traditional methods, while ISRPM demonstrates relatively higher statistical power in DIF detection. For example, when only interactive DIF exists and the sample size per group is $N=500$, the power of ISRPM is 0.22–0.81, whereas the power for FS-Wald and FS-LR ranges from 0.05–0.08 and 0.08–0.10, respectively (with other methods ranging between 0.06–0.08 and 0.08–0.11 or 0.07–0.09). When only main effects are present, the performance of ISRPM under large sample size conditions is comparable to other methods. However, when the sample size per group drops to $N=100$, the overall statistical power of ISRPM is inferior to the five traditional methods. Overall, under conditions where the sample size per group is $N=500$ or above, the general performance of ISRPM in DIF detection is equivalent to traditional methods, and it exhibits higher statistical power in scenarios involving interactions. As item quality improves, the statistical power of all methods shows an upward trend. According to the results in [TABLE:N] of the online appendix, neither the degree of correlation between attributes nor whether demographic covariates affect the distribution of attribute mastery patterns had a significant impact on the statistical power of ISRPM.

Regarding Type I error rates, according to Bradley (1978), at a nominal significance level of $\alpha = 0.05$ with 100 simulation replications, the actual observed Type I error rate may not exactly match the nominal level due to sampling error. For this study ($ \alpha = 0.05, R = 100 $), the robustness interval is $[0.007, 0.093]$. If the observed Type I error rate of a detection method falls within this range, its control is considered reasonable. As shown in [FIGURE:N], ISRPM maintains stable control over Type I error rates across the vast majority of experimental conditions, with overall results falling within the reasonable interval. Further comparison reveals that in most experimental conditions, the Type I error rate of ISRPM is lower than that of the other five traditional methods. This suggests that in scenarios involving the simultaneous detection of multiple covariates, the new method generally outperforms traditional methods in controlling Type I errors. For instance, when only main effects exist, the Type I error control of ISRPM meets expectations, while the ranges for FS-Wald and FS-LR are approximately 0.09–0.12.

In some conditions, the results for these methods exceeded the upper bound of the reasonable interval (0.093), whereas ISRPM demonstrated higher stability in Type I error control.

Sample size did not show a significant effect on the Type I error rates of any detection method. Overall, the Type I error rates for all methods were lower under high-quality item conditions compared to medium-quality conditions, which may be attributed to the higher precision of parameter estimation under high-quality items. The research results presented in [TABLE:N] of the online appendix indicate that neither the correlation between attributes nor whether demographic covariates influence the distribution of attribute mastery patterns significantly affects the Type I error rates of the methods.

Statistical power results of DIF detection methods under different experimental conditions.

Type I error rates of DIF detection methods under different experimental conditions.

4.1 数据介绍

The empirical analysis of this study utilizes the Diagnostic Classification Version of the Schizotypal Personality Questionnaire (DC-SPQ), developed by \cite{2019}, to demonstrate the applicability of the proposed ISRPM within authentic testing scenarios. Furthermore, the results obtained from this model are compared against five established international cognitive diagnostic methods.

Each item in the DC-SPQ is scored using a binary method. This study adopts the Q-matrix for the DC-SPQ as defined by \cite{2019} (refer to Table 1 in the original literature). The test data provided by \cite{2019} were collected from 981 students across seven universities in three Chinese cities, with a mean age of 19.81 years ($SD = 1.79$). In this research, gender and registered residence (hukou) are selected as common covariates. Regarding the gender distribution, female participants accounted for 62.3% ($N = 611$); in terms of residence, 43.1% ($N = 423$) of the participants originated from urban areas.

结果

The C-SPQ analysis was conducted with the overall significance level set accordingly. To comprehensively compare the detection performance of different methods, this study analyzed the DC-SPQ using the proposed ISRPM alongside five existing cognitive diagnostic methods (Wald, FS-Wald, FS-LR, etc.) at significance levels of $0.05$ and $0.01$. It should be emphasized that the search termination rules for the optimal splitting covariates and their corresponding split points in the ISRPM remained consistent with those used in the simulation study.

The detection results of the ISRPM showed high consistency with the other five traditional methods across the different significance levels ($0.05$ and $0.01$) for the DC-SPQ. For instance, at the overall significance level...

0.05 的条件下

The detection results for Item 18, which was identified by all methods as potentially exhibiting Differential Item Functioning (DIF), showed certain discrepancies across the different approaches. ISRPM identified this item as potentially containing interaction DIF caused by the interaction between gender and household registration. In contrast, the DC-SPQ method identified it as potentially exhibiting only a main effect triggered by gender, while both the FS-Wald and FS-LR methods failed to detect any DIF for this item. [FIGURE:1] presents the recursive partitioning tree structure output by ISRPM for Item 18. As shown in the figure, both paths for gender in the first layer are represented by dashed lines, indicating that the item may not exhibit a main effect related to gender. In the second-layer partition, the left branch remains a dashed arrow, while the right branch is a solid arrow. This suggests that there are significant differences in the response patterns of females from different household registration areas, whereas the differences between groups for males across the household registration variable are not significant. Consequently, Item 18 likely contains interaction DIF caused by the interaction between gender and household registration. Although there remains a degree of uncertainty in the post-hoc interpretation of detection results, the empirical analysis demonstrates that ISRPM exhibits a potential advantage in identifying interaction DIF.

It should be noted that the interpretation of detection results should not rely solely on statistical tests. A more reasonable strategy is to combine statistical analysis with expert judgment: first, use statistical methods to screen for items that may contain DIF, and subsequently, have subject matter experts conduct a rigorous review of the content and phrasing of these items. By combining quantitative and qualitative analysis, test developers can more comprehensively evaluate the substantive significance of the detection results, providing a more evidence-based reference for test revision and quality control. [TABLE:1] presents the detection results for each item under different detection methods, including ISRPM, FS-Wald, FS-LR, and DC-SPQ. This table only displays results for items where at least one method detected DIF. For ISRPM, the table lists the first optimal splitting variable automatically searched for the current item—that is, the first-layer splitting variable in the recursive partitioning tree structure. Different items may display different first-layer optimal splitting variables depending on the splitting effect of each variable. In the table, "1" indicates that the current method flagged the item as having DIF, while "0" indicates that the method did not flag the item. Results outside the parentheses represent detection results at an overall significance level of $\alpha = 0.05$, while results inside the parentheses represent results at a significance level of $\alpha = 0.01$.

[FIGURE:1] Recursive partitioning tree diagram output by ISRPM.

讨论

The Standards for Educational and Psychological Testing emphasize that fairness and accuracy are paramount. Consequently, analyzing Differential Item Functioning (DIF) has become an indispensable quality assessment component in the development and validation of psychological and educational tests, particularly for evaluating test quality and validity. In the context of Cognitive Diagnostic Assessment (CDA), existing DIF detection methods are typically limited to analyzing a single covariate in a single analysis, thereby restricting their focus to main effects. While these methods are reliable for identifying main effects, they struggle to effectively detect interactive DIF caused by the interaction of multiple covariates, which poses a challenge to the measurement fairness and validity of cognitive diagnostic evaluations. To address this issue, this study integrates Cognitive Diagnostic Models (CDMs) with recursive partitioning techniques from data mining to propose a new DIF detection method: Item-Level Sequential Recursive Partitioning (ISRPM). This method aims to provide robust methodological support for detecting both main and interaction effects in cognitive diagnostic tests.

The primary contribution of this research lies in the proposed ISRPM framework, which can simultaneously process multiple covariates and effectively detect both main and interaction effects. This approach is expected to further enhance the detection precision of DIF identification methods within the field of cognitive diagnosis.

To evaluate the detection performance of the proposed method, this study conducted a series of Monte Carlo simulations. These simulations primarily investigated two critical aspects: first, the impact of factors such as sample size per group, item quality, the degree of correlation between attributes, and whether demographic covariates influence the distribution of attribute mastery patterns on detection performance; and second, a comprehensive performance comparison between the new method and five established international cognitive diagnostic DIF detection methods (Wald, LR, Score, FS-Wald, and FS-LR).

To demonstrate the applicability of the method in real-world settings, this study utilized the cognitive diagnostic version of the Schizotypal Personality Questionnaire (SPQ) as an empirical example. The performance of the proposed method was then compared against existing international benchmarks.

The research results indicate that the sample size of each group and the quality of the items are significant factors influencing the detection performance of ISRPM. Overall, the ISRPM method demonstrated superior performance in terms of statistical power and Type I error rates compared to the FS-Wald and other traditional methods.

方法

In cases where only main effects exist, the detection performance of ISRPM is comparable to other methods when the sample size reaches 1,000 or more per group. However, when the sample size per group drops to 500, its performance becomes somewhat limited. Notably, neither the degree of correlation between attributes nor the influence of demographic covariates on the distribution of attribute mastery patterns significantly affected the detection performance of ISRPM.

Through simulation studies and empirical analysis, the advantages of ISRPM can be summarized as follows: it is expected to further enhance the diagnostic accuracy of cognitive diagnostic tests. Simulation results indicate that, compared to existing detection methods, ISRPM demonstrates overall superior performance regarding items and the related covariates that lead to measurement bias.

The new method is capable of simultaneously identifying main effects and interactions. This ability to handle interactive characteristics is expected to help researchers more comprehensively analyze the complex demographic sources that lead to measurement bias in tests.

ISRPM is expected to provide test developers with more detailed diagnostic information. Compared to traditional methods, ISRPM can not only accurately identify the presence of bias but also reveal how the main effects and interactions of different covariates influence item performance. This allows test developers to propose more targeted item revisions. Based on the results of the simulation and empirical studies, the following recommendations are offered to potential users of ISRPM:

Regarding the choice of cognitive diagnostic models: This study verified the applicability of ISRPM under the saturated cognitive diagnostic model, G-DINA, through simulation and empirical analysis. It should be noted that this method can also be extended to several reduced forms of the G-DINA model, such as the DINA and DINO models. Therefore, researchers can flexibly select a model based on their research objectives and data characteristics. Regarding the number and types of covariates: Simulation results show that ISRPM can support the simultaneous analysis of two or more covariates. ISRPM demonstrates good applicability when a study involves multiple covariates; however, it is important to note that as the number of covariates increases, users should increase the sample size as much as possible to maintain the stability and detection accuracy of the new method. Therefore, in scenarios involving multiple covariates, users need to balance the relationship between sample size and the number of variables. To further verify the performance of the new method under conditions including multicategorical and continuous covariates, this paper presents two supplementary experiments in the online appendix (see Online Appendix). The results indicate that ISRPM maintains acceptable detection performance even with more than two covariates and demonstrates acceptable effects across different types of covariates.

Regarding sample size: Simulation results indicate that sample size is a critical factor affecting the detection performance of ISRPM. When the sample size per group reaches 1,000, ISRPM exhibits ideal detection accuracy under most conditions. However, when the sample size drops to 500, its statistical power shows a downward trend, and its overall performance is inferior to traditional methods. Therefore, in practical applications, this paper suggests ensuring a sample size of at least 1,000 per group whenever possible to obtain more robust detection results.

Regarding the choice of detection methods: Selecting the appropriate method in cognitive diagnostic testing should involve a comprehensive consideration of key factors such as research objectives, sample size, and the number of covariates of interest. While ISRPM shows better results in certain aspects, its performance is not superior to traditional methods across all conditions.

For studies focusing only on main effects with small sample sizes (e.g., fewer than 500 per group), traditional methods are more recommended. Conversely, when the sample size per group is large (e.g., 1,000 or more) or when the study focuses on both main effects and interactions, ISRPM is the preferred choice. It is important to remember that when sample sizes are small, the accuracy of results may be limited regardless of the method used; therefore, conclusions should not be drawn solely based on statistical test results. Regarding item processing: When conducting research in cognitive diagnostic testing, researchers should not rely exclusively on statistical tests. A more robust approach is to combine statistical analysis with expert evaluation. First, use ISRPM or other methods to screen for items that may exhibit bias, and then invite domain experts to review the content and phrasing of these items. By combining quantitative analysis with qualitative judgment, researchers can more accurately determine whether the detected bias has practical significance, thereby providing more evidence-based decision support for test revision and quality control.

5.2 认知诊断

Differences and Connections in DIF Detection Methods: Research on test fairness primarily relies on three major measurement frameworks: Classical Test Theory (CTT), Item Response Theory (IRT), and Cognitive Diagnosis Models (CDM). While DIF detection methods under these three frameworks share common core objectives, detection logic, and fundamental principles, they exhibit distinct differences in their theoretical foundations, equating procedures, and ability-matching criteria.

In terms of core objectives, all methods remain consistent in their aim to identify items that may cause measurement bias, thereby maintaining the fairness of test results. Regardless of the measurement framework employed, they all revolve around a common logic: under the premise of controlling for ability levels, they examine whether there are systematic differences in the performance of different examinee groups on the same item. Finally, these methods all follow the principle of comparing groups after controlling for ability levels, rather than making judgments based directly on raw scores. Specifically, CTT methods use observed total scores for matching, IRT methods match via latent traits, and CDM methods match ability based on the examinees' attribute mastery patterns.

While consistent in core objectives, basic logic, and fundamental principles, these methods differ significantly in their theoretical foundations, equating procedures, and ability-matching standards. First, regarding theoretical foundations, CTT methods are based on Classical Test Theory, IRT methods rely on Item Response Theory, and CDM methods are constructed upon Cognitive Diagnosis Theory. Second, regarding equating procedures, IRT-based DIF detection typically requires equating to place the ability scores of different groups on the same scale for fair comparison. In contrast, the Cognitive Diagnosis Models (CDM) used in CDM methods possess an inherent parameter-equating property. These models primarily measure examinees'

mastery of discrete cognitive attributes rather than positioning their ability on a continuous scale of latent traits. Consequently, CDM-based DIF detection usually does not require additional equating procedures. Finally, regarding ability-matching standards, CTT methods match through observed total scores, IRT methods match through latent trait scores, and CDM methods match based on the examinees' mastery patterns of cognitive attributes.

Research Limitations and Future Outlook: Although this study has made progress in detecting main effects of DIF within cognitive diagnostic assessments, several limitations and areas for improvement remain, including the following aspects:

The FS-Wald and FS-LR methods proposed in this study show room for improvement in small sample conditions, implying that the Improved Structural Recursive Partitioning Method (ISRPM) requires further refinement. This study primarily focused on dichotomously scored cognitive diagnosis models (e.g., G-DINA). The application of recursive partitioning techniques in polytomously scored CDMs and multi-strategy CDMs remains an unexplored area. Future research could further attempt to apply ISRPM within the framework of polytomous and multi-strategy models. Based on recursive partitioning techniques \cite{Strobl2016}, the current ISRPM only supports binary splits, which may limit its ability to identify complex DIF patterns manifested across multiple intervals of a single covariate. Future research could extend ISRPM to perform multivariate splits to enhance its application in more complex scenarios. Although this study systematically evaluated the performance of ISRPM through simulation experiments, it is important to recognize the differences between simulation and empirical research. While simulation conditions are meticulously designed to strictly control variables—allowing for the investigation of a method's validity and effectiveness under various test conditions—empirical data structures are often more complex and susceptible to measurement error, heterogeneity in examinee characteristics, and potential confounding factors. Therefore, simulation results should be viewed as a preliminary validation of the new method's detection performance rather than direct evidence of its effectiveness in practical testing. Future research should further investigate the applicability and robustness of ISRPM using more diverse types and larger samples of empirical data.

5.4 结论

The primary conclusions of this study are as follows: This research proposes the Item-level Sequentially Recursive Partitioning Method (ISRPM) within a cognitive diagnosis framework to simultaneously detect main effects. This method achieves a full integration of recursive partitioning techniques with cognitive diagnostic models.

By integrating these approaches, the method is not only capable of processing multiple covariates simultaneously in a single detection procedure but can also identify specific items exhibiting differential item functioning (DIF) along with the specific covariates that lead to such manifestations.

Overall, ISRPM demonstrates acceptable performance. Its general effectiveness is largely consistent with the FS-Wald and FS-LR methods; however, it exhibits relatively superior identification capabilities when dealing with interaction effects.

Comprehensive simulation results indicate that ISRPM maintains ideal detection performance when the sample size per group is large, the effect size is substantial, and the item quality is high.

参考文献

American Education Research Association, American Psychological Association, National Council Measurement Education. (2014). standards educational psychological testing Washington, Publications Barnett, Lewis, (1994).

Outliers statistical Hoboken: Wiley. Bauer, (2017). general model testing measurement invariance differential functioning.

Psychological Methods, Benjamini, Hochberg, (1995).

Controlling false discovery rate: practical powerful approach multiple testing.

Journal Royal statistical society: series (Methodological), Belzak, (2023). ultidimensionality easurement takes esting: achine earning valuate omplex ources ifferential unctioning.

Educational Measurement: Issues Practice, Belzak, Bauer, (2020).

Improving assessment measurement invariance: regularization select anchor items identify differential functioning.

Psychological

methods

Bollmann, Berger, Tutz, (2018). Item- ocused etection ifferential unctioning artial redit odels.

Educational Psychological Measurement Collins, (1990).

Black feminist thought: nowledge, consciousness, politics empowerment UnwinHyman.

Torre, (2011). generalized model framework. Psychometrika, Torre, Rossi, (2018).

Analysis

clinical cognitive diagnosis

modeling framework. Measurement Evaluation Counseling Development DiBello, Roussos, Stout, (2006). review cognitively diagnostic assessment

summary

psychometric models. Handbook statistics, Finch, Finch, French, (2015).

Recursive artitioning dentify otential auses ifferential unctioning ross- ational International Journal Testing,

Holland, P. W, & Wainer, H. (1993). Differential item functioning .

Hillsdale, NJ: Erlbaum.

Hothorn, Hornik, Zeileis, (2006). Unbiased ecursive artitioning: onditional nference ramework.

Journal Computational Graphical Statistics, (2013).

Differential functioning assessment cognitive diagnostic modeling:

Applying investigate generalized model framework (Unpublished doctoral dissertation), University Delaware.

Torre, Nandakumar, (2014). Differential functioning assessment cognitive diagnostic modeling:

Application investigate model. Journal Educational Measurement, Komboz, Strobl, Zeileis, (2016).

Tree- lobal olytomous Rasch odels. Educational Psychological Measurement, Leighton, Gierl, (2007).

Cognitive diagnostic assessment education: heory Applications

Cambridge, UK: Cambridge University Press.

(2008). modified higher-order model detecting differential functioning differential

attribute functioning (Unpublished doctoral dissertation).

University of Georgia.

Zhou, Huang, Yang, (2020). Assessing kindergarteners mathematics problem solving: development cognitive diagnostic test.

Studies Educational Evaluation Wang, W.-C. (2015).

Assessment ifferential unctioning ognitive iagnosis odels: xample.

Journal Educational Measurement, Tian, (2016). improved

method

differential functioning detection cognitive diagnosis models: application statistic based observed information matrix.

Psychologica Sinica , 48 (05), 588 – 598.

Improved Method for Testing Differential Item Functioning in Cognitive Diagnosis Models: An Information Matrix Approach

Abstract

Differential Item Functioning (DIF) is a critical issue in ensuring the fairness and validity of educational and psychological assessments. In the context of Cognitive Diagnosis Models (CDMs), identifying items that function differently across groups is essential for accurate attribute profile estimation. This study proposes an improved method for testing DIF by leveraging the information matrix within the CDM framework. By refining the estimation of the variance-covariance matrix and optimizing the test statistic, the proposed approach aims to enhance the power of DIF detection while maintaining strict control over Type I error rates. Simulation results demonstrate that the improved information matrix-based method outperforms traditional Wald tests and Likelihood Ratio Tests (LRT), particularly in conditions with small sample sizes or unbalanced group designs.

Introduction

Cognitive Diagnosis Models (CDMs) have gained significant attention in recent years due to their ability to provide fine-grained feedback regarding an individual's mastery of specific knowledge components or skills. However, the validity of these diagnostic inferences relies heavily on the assumption of measurement invariance across different sub-populations (e.g., gender, ethnicity, or regional groups). When an item demonstrates Differential Item Functioning (DIF), individuals from different groups with the same underlying attribute profile have different probabilities of endorsing or correctly answering that item.

Traditional DIF detection methods, such as those developed for Item Response Theory (IRT), cannot be directly applied to CDMs without accounting for the discrete nature of the latent attribute space. While several CDM-specific DIF detection methods have been proposed—including the Wald test and the Likelihood Ratio Test (LRT)—they often face challenges regarding computational efficiency and statistical power under non-ideal conditions. This paper introduces an improved testing procedure based on the information matrix to address these limitations.

Methodology

The Cognitive Diagnosis Framework

In CDMs, the probability of a correct response to item $i$ by individual $j$ is typically modeled as a function of the individual's attribute profile $\alpha_j$ and the item parameters. For the Deterministic Input, Noisy "And" gate (DINA) model, the response probability is defined by the slipping parameter $s_i$ and the guessing parameter $g_i$:

$$P(X_{ij} = 1 | \alpha_j) = g_i^{(1-\eta_{ij})} (1-s_i)^{\eta_{ij}}$$

where $\eta_{ij}

Torre, (2020). GDINA: package cognitive diagnosis modeling.

Journal Statistical Software, (14), Terzi, Torre, (2021).

Detecting differential functioning using multiple-group cognitive diagnosis models.

Applied Psychological Measurement, Magis, land, Tuerlinckx, Boeck, (2010). general framework package detection dichotomous differential functioning.

Behavior Research Methods, Meade, Wright, (2012). Solving measurement invariance anchor problem response theory.

Journal Applied Psychology, Meredith, (1993). Measurement invariance, factor

analysis

factorial invariance. Psychometrika, Nichols, P.D., Chipman, S.F., Brennan, (1995).

Cognitively diagnostic assessment ed.). Routledge.

Paulsen, Svetina, Feng, Valdivia, (2020). Examining impact differential functioning classification accuracy cognitive diagnostic models.

Applied Psychological Measurement, anguage nvironment tatistical omputing.

Foundation Statistical

Computing, Vienna, Austria.

Rupp, Templin, Henson, (2010). Diagnostic measurement:

Theory, methods, applications Guilford Press. Strobl, Kopf, Zeileis, Rasch rees: ethod etecting ifferential unctioning Rasch odel.

Psychometrika Strobl, Malley, Tutz, (2009).

introduction

recursive partitioning: Rationale, application, characteristics classification regression trees, bagging, random forests.

Psychological Methods, Wang, Song, Zhou, (2022). Using nformation atrix-based ethod etect ifferential unctioning ultiple roups ognitive iagnostic Journal Psychological Science, (03), (2022).

Multi-Group Differential Item Functioning in Cognitive Diagnosis Models Based on the Information Matrix

Cognitive Diagnosis Models (CDMs) are a class of restricted latent class models used to evaluate the mastery of specific fine-grained skills or attributes. A critical issue in the application of these models is ensuring measurement invariance across different populations, often referred to as Differential Item Functioning (DIF). When DIF occurs, individuals from different groups (e.g., gender, ethnicity, or regional groups) who possess the same latent attribute profile have different probabilities of correctly answering an item. Failure to account for DIF can lead to biased parameter estimates and inaccurate diagnostic classifications.

Theoretical Framework

In the context of multi-group cognitive diagnosis, the probability of a correct response for an individual in group $g$ on item $j$ is typically modeled as a function of their attribute profile $\alpha$. For the Deterministic Input, Noisy "And" gate (DINA) model, this is expressed as:

$$P(X_{gj} = 1 | \alpha) = (1 - s_{gj})^{\eta_{j}(\alpha)} (g_{gj})^{1 - \eta_{j}(\alpha)}$$

where $s_{gj}$ represents the slipping parameter and $g_{gj}$ represents the guessing parameter for group $g$. Measurement invariance implies that $s_{1j} = s_{2j} = \dots = s_{Gj}$ and $g_{1j} = g_{2j} = \dots = g_{Gj}$ for all groups.

Information Matrix and DIF Detection

Recent advancements in psychometrics have utilized the Fisher Information Matrix to detect DIF and evaluate model fit. The information matrix provides a measure of the amount of information that observed data carries about unknown parameters. In multi-group CDMs, the expected information matrix can be decomposed to test constraints across groups. By comparing the restricted model (where parameters are equal across groups) with an unrestricted model, researchers can utilize Wald tests or Likelihood Ratio Tests (LRT) to identify items exhibiting DIF.

[TABLE:1]

The application of the information matrix approach allows for a more nuanced understanding of how items function across different demographic strata. Specifically, it enables the identification of whether DIF is uniform (affecting all attribute profiles equally) or non-uniform (varying depending on the specific combination of attributes mastered).

Characterizing Mental Health Symptom Profiles

As highlighted by de la Torre, Larimer,

Prevention Science: Official Journal Society Prevention Research,

Huang, Vermunt, (2015). response theory covariates (IRT-C):

Assessing recovery differential functioning three-parameter logistic model.

Educational Psychological Measurement, Templin, Henson, (2006).

Measurement psychological disorders using cognitive diagnosis models.

Psychological Methods, Wang, (2019). Advanced ognitive iagnosis Beijing:

Beijing Normal University Publishing Group. (2019).

Advanced Cognitive Diagnosis

Item-Focused Identification of Differential Item Functioning

In the field of psychometrics and educational measurement, ensuring the fairness and validity of assessments is paramount. One of the most critical challenges in this regard is Differential Item Functioning (DIF). DIF occurs when individuals from different groups (e.g., defined by gender, ethnicity, or socioeconomic status) who possess the same level of the underlying latent trait have different probabilities of endorsing or correctly answering a specific item. As discussed by Tutz and Berger (2016), identifying DIF is essential for maintaining the structural integrity of cognitive diagnostic models and ensuring that measurement outcomes are not biased against specific subpopulations.

The identification of DIF has traditionally relied on group-level comparisons. However, item-focused approaches provide a more granular lens through which to examine measurement bias. By focusing on the specific parameters of individual items, researchers can pinpoint exactly how and where an item deviates from the expected response pattern across groups. This is particularly relevant in the context of cognitive diagnosis, where items are often mapped to multiple fine-grained attributes or skills. If an item functions differently across groups, it may suggest that the item is measuring unintended constructs or that the cognitive requirements of the item are not uniform for all examinees.

Tutz and Berger (2016) emphasize the use of penalized regression techniques and generalized additive models to detect DIF in a more flexible and computationally efficient manner. These methods allow for the simultaneous evaluation of multiple items while controlling for the overall latent trait distribution. By treating DIF as a selection problem within a regression framework, researchers can identify items that exhibit significant group-specific effects without the need for exhaustive pairwise comparisons. This item-focused perspective not only enhances the precision of DIF detection but also provides diagnostic information that can be used to refine item development and improve the overall quality of the assessment instrument.

Ultimately, the integration of advanced DIF detection methods into cognitive diagnostic frameworks represents a significant step forward in the pursuit of equitable measurement. By systematically identifying and addressing item-level bias, practitioners can ensure that cognitive profiles and diagnostic feedback are accurate reflections of an examinee's true abilities, rather than artifacts of group membership. This rigorous approach to validation is fundamental to the application of cognitive diagnosis in high-stakes educational and psychological settings.

Psychometrika Wang, (2019). Development instrument depression cognitive diagnosis models.

Frontiers Psychology, Wang, (2019). Development erification ognitive iagnostic ross-grade upils athematics

l earning a bility, Chinese Exam, 8 , 71 – 78.

Development and Validation of a Cross-Grade Cognitive Diagnostic Test for Elementary Mathematics

1. Introduction

Cognitive Diagnostic Assessment (CDA) has emerged as a critical tool in educational psychology, aiming to identify the specific strengths and weaknesses of students' knowledge structures. Unlike traditional testing, which provides a single aggregate score, CDA offers a fine-grained profile of a student's mastery of specific attributes. This study focuses on the development and validation of a cross-grade cognitive diagnostic test designed for elementary mathematics, specifically targeting the transition and progression of mathematical competencies across different grade levels.

2. Theoretical Framework and Attribute Specification

The foundation of any cognitive diagnostic model (CDM) lies in the definition of the Q-matrix, which maps test items to specific cognitive attributes. Drawing upon the work of Wang and Bian (2014) regarding the comparison of detection methods, this study identifies key mathematical attributes essential for elementary students. These attributes include basic arithmetic operations, spatial reasoning, and problem-solving heuristics.

[TABLE:1]

As shown in [TABLE:1], the attributes were selected based on the national curriculum standards and validated through expert consultation. The cross-grade nature of the test requires that these attributes remain consistent in definition while increasing in complexity across grade levels.

3. Methodology

3.1 Participants and Procedure

The study involved a large-scale sample of elementary students from grades 3 through 6. A total of $N$ students participated in the validation phase. The test was administered under standardized conditions, and data were collected using an electronic platform to ensure accuracy in response recording.

3.2 Mathematical Modeling

To analyze the diagnostic power of the test, we utilized the Deterministic Input, Noisy "And" gate (DINA) model. The probability of student $i$ providing a correct response to item $j$ is defined as:

$$P(X_{ij} = 1 | \alpha_i) = (1 - s_j)^{\eta_{ij}} g_j^{(1 - \eta_{ij})}$$

where:
- $\eta_{ij} = \prod_{k=1}^K \alpha_{ik}^{q_{jk}}$ represents the ideal response state.
- $s_j$ is the slipping parameter (the probability of an expert failing an item).
- $g_j$ is the guessing parameter (the probability of a non-master passing an item).

methods

cognitive diagnostic test.

Psychologica Sinica , 4 6 ( 12 ), 1923 – 1932 .

Comparison of Differential Item Functioning Detection Methods in Cognitive Diagnostic Tests

Introduction

Cognitive Diagnostic Models (CDMs), also known as Diagnostic Classification Models (DCMs), are a class of restricted latent class models that have gained significant attention in the fields of psychology and education. Unlike traditional Item Response Theory (IRT), which provides a single continuous score representing a latent trait, CDMs are designed to assess the specific strengths and weaknesses of individuals across a set of fine-grained attributes. By providing a multi-dimensional profile of mastery or non-mastery, CDMs offer more detailed feedback for formative assessment and personalized intervention.

A critical requirement for the validity and fairness of any psychometric assessment is that the test items function consistently across different groups of examinees (e.g., gender, ethnicity, or socioeconomic status). When examinees from different groups who possess the same underlying ability or attribute profile have different probabilities of correctly answering an item, the item is said to exhibit Differential Item Functioning (DIF). In the context of CDMs, detecting DIF is essential to ensure that the diagnostic profiles generated are accurate and not biased by group membership.

Differential Item Functioning in CDMs

In cognitive diagnosis, DIF occurs when the item response function differs across groups for examinees with the same attribute mastery pattern. Specifically, let $\boldsymbol{\alpha}$ represent the attribute mastery profile. DIF exists for item $j$ if:

$$P(X_j = 1 | \boldsymbol{\alpha}, G = R) \neq P(X_j = 1 | \boldsymbol{\alpha}, G = F)$$

where $G$ denotes group membership (Reference group $R$ or Focal group $F$). If an item exhibits DIF, the parameters governing the item—such as the slipping parameter ($s_j$) and the guessing parameter ($g_j$) in the DINA model—will differ between groups.

[TABLE:1]

Methods for Detecting DIF

Several methods have been proposed and adapted for detecting DIF within the CDM framework. These methods can generally be categorized into those based on the Wald test, Likelihood Ratio Tests (LRT), and various Mantel-Haenszel (MH) adaptations.

1. The Wald Test

The Wald test is a flexible approach that compares item parameters estimated separately for the reference and focal groups. For a given item $j$, the null hypothesis $H_0: \boldsymbol{\beta}_{j,R} = \boldsymbol

International journal

methods

psychiatric research, e1807. Yuan, (2021). Differential unctioning nalysis ithout riori nformation nchor tems: raphical Psychometrika Development Effect Interactive Detection

Method

Cognitive Diagnosis Assessments: Recursive

Partitioning-Based Perspective

Zhichen Daxun Dongbo

( 1 School of Psychology, Jiangxi Normal University, Nanchang 330022, China )

( 2 College of Psychology, Liaoning Normal University, Dalian 116029, China )

Abstract

growing recognition advantages cognitive diagnosis psychological educational measurement, applying framework development become important research direction field psychology. development cognitive diagnostic assessments, detecting differential functioning (DIF) remains crucial quality control procedure ensure fairness validity.

However, existing CD-based detection

methods

typically focus single covariate time. While these approaches effective identifying effect induced single covariate, limited detecting interactive caused interaction among multiple covariates. limitations compromise fairness interpretability assessment outcomes. address issue, present study integrates modeling recursive partitioning techniques proposing novel detection method, namely Item-based Sequential Recursive Partitioning

Method

(ISRPM). Building principles recursive partitioning, ISRPM allows simultaneous consideration multiple covariates within single detection procedure facilitates identification effect interactive cognitive diagnostic assessments. evaluate performance proposed method, series Monte Carlo simulation studies conducted focusing objectives: examining factors sample group, magnitude, type, quality, correlations among attributes, influence demographic covariates attribute mastery distribution affect performance ISRPM; comparing ISRPM several existing detection

methods

across varied experimental conditions. addition, illustrate practical utility, ISRPM applied cognitive diagnostic version Schizotypal Personality Questionnaire (DC-SPQ) compared established detection methods.

results

showed sample size, magnitude, quality substantially influenced performance methods; items exhibited interactive ISRPM achieved higher detection accuracy Wald, FS-Wald, FS-LR, Mantel Haenszel

approaches. effect present, overall performance ISRPM comparable existing methods.

These findings suggest ISRPM provides flexible effective framework identifying effect interactive cognitive diagnostic assessments, thereby contributing methodological advancements fairness evaluation broader application CD-based measurement psychological educational measurement.

Keywords

cognitive diagnosis assessments, differential functioning, effect

interactive DIF, recursive partitioning

Statistical power results for different cognitive diagnostic detection methods under various experimental conditions, assuming a uniform distribution of attribute mastery patterns. Sample size per group: ISRPM, FS-wald, FS-LR. Conditions include: main effect only, simultaneous presence of main and interaction effects, and interaction effect only.

Type I error rate results for different cognitive diagnostic detection methods under various experimental conditions, assuming a uniform distribution of attribute mastery patterns. Sample size per group: ISRPM, FS-wald, FS-LR. Conditions include: main effect only, simultaneous presence of main and interaction effects, and interaction effect only.

Submission history

[v1] 2025-12-09