Impact of differential item functioning on group score reporting in the context of large-scale assessments

Joo, Sean; Ali, Usama; Robin, Frederic; Shin, Hyo Jeong

doi:10.1186/s40536-022-00135-7

Research
Open access
Published: 15 November 2022

Impact of differential item functioning on group score reporting in the context of large-scale assessments

Sean Joo ORCID: orcid.org/0000-0003-4861-4362¹,
Usama Ali^2,4,
Frederic Robin² &
…
Hyo Jeong Shin³

Large-scale Assessments in Education volume 10, Article number: 18 (2022) Cite this article

2788 Accesses
4 Citations
2 Altmetric
Metrics details

A Correction to this article was published on 22 December 2022

This article has been updated

Abstract

We investigated the potential impact of differential item functioning (DIF) on group-level mean and standard deviation estimates using empirical and simulated data in the context of large-scale assessment. For the empirical investigation, PISA 2018 cognitive domains (Reading, Mathematics, and Science) data were analyzed using Jackknife sampling to explore the impact of DIF on the country scores and their standard errors. We found that the countries that have a large number of DIF items tend to increase the difference of the country scores computed with and without the DIF adjustment. In addition, standard errors of the country score differences also increased with the number of DIF items. For the simulation study, we evaluated bias and root mean squared error (RMSE) of the group mean and standard deviation estimates using the multigroup item response theory (IRT) model to explore the extent to which DIF items create a bias of the group mean scores and how effectively the DIF adjustment corrects the bias under various conditions. We found that the DIF adjustment reduced the bias by 50% on average. The implications and limitations of the study are further discussed.

Introduction

The core purpose of national and international large-scale assessments (LSAs), such as National Assessment of Educational Progress (NAEP), the Programme for International Student Assessment (PISA), and the Programme for the International Assessment of Adult Competencies (PIAAC) is a comparison of education qualities among regions, states, and nations. Such comparisons provide important insights for educational researchers and policymakers to evaluate the current educational system and students’ academic progress over time (e.g., Cosgrove & Cartwright 2014; Neumann et al., 2010). To achieve this, constructing high-level scale comparability is a critical requirement. Scale comparability refers to the condition in which assessments are comparable across all country- or state-level groups and across assessment cycles, such that the group-level scores are on the same metric (e.g., Mazzeo & von Davier 2014; Oliveri & von Davier, 2014).

In LSAs, item response theory (IRT) methodology has been implemented in the scaling procedure to establish a common metric allowing for comparability across participating groups and assessment cycles. IRT analysis also allows researchers to investigate psychometric properties of items (i.e., slopes and difficulties), reliability, and validity of the assessments in general. For example, both PISA and PIAAC incorporated a two-parameter logistic model (2PLM; Birnbaum 1968) and generalized partial credit model (GPCM; Muraki 1992) as measurement models for dichotomous and polytomous response items, respectively. Moreover, multigroup IRT scaling (a.k.a. concurrent calibration; Bock & Zimowski 1997; Kolen & Brennan, 2014) has been implemented since the PISA 2015 main survey to put the multiple country-by-language-by-cycle scores on the same metric (OECD, 2016). Specifically, items are calibrated simultaneously with the equality constraint on their parameters across the participating countries and economies and assessment cycles in the multigroup IRT model (von Davier et al., 2019). These estimated item parameters are also referred to as international or common item parameters that contribute to the scale comparability.

Recently, educational researchers and practitioners have raised a practical question regarding common item parameters in LSAs: Does item calibration with an equality constraint present a scaling approach that is too restrictive? (e.g., Rutkowski et al., 2010; Rutkowski & Svetina, 2014; Svetina & Rutkowski, 2014; Switzer et al., 2017). Wu (2010) also noted that LSAs generally include the various sources of error induced by measurement, sampling, and equating procedures, and the scaling procedure should identify the source and magnitude of error to increase the validity of the results. It has been explicitly argued that these sources of error tend to create item misfits from the set of common item parameters for a particular group or assessment cycle (Oliveri & von Davier, 2011; Oliveri & Davier, 2014). For example, in PISA, the target population is considerably diverse in the sense that each student’s country, language, culture, ethnicity, socioeconomic status, and background are different within the assessment sample (de Jong et al., 2007; Kreiner & Christensen, 2014; Sachse et al., 2016). Several cross-country studies have also reported that the scale comparability is not easily retained, and country-, language- or culture-specific item parameter calibration should be carefully investigated (e.g., Ercikan 2002; Ercikan & Koh, 2005; Gierl & Khaliq, 2001).

For these reasons, the heterogeneity of item parameters in LSAs should be considered in the scaling process, and several studies have suggested a group-specific or assessment cycle-specific item parameter approach to increase the validity of the scores as well as to improve the measurement precision (Oliveri & von Davier, 2011; Oliveri & von Davier, 2014). In addition, for trend items, which refer to items that were previously administered in past assessment cycles, fixed item parameter calibration (FIPC) has been suggested and implemented in the operational scaling procedure. FIPC links the previously administered scales to the current assessment scales by “fixing” the estimated item parameters from the previous assessment cycles, including both international and country-specific item parameters. Using this approach, the linking error from cycle to cycle can be substantially reduced without losing the validity of the scales. A previous study demonstrated the benefits of the FIPC approach in the context of PISA (König et al., 2021).

Regardless of group- or assessment cycle-specific item parameters, the item misfit should be precisely addressed in the IRT scaling procedure to obtain accurate group performance proficiency estimates that are comparable across groups and cycles. The term, item misfit, is also referred to as the lack of measurement invariance (Meredith, 1993), item bias (Lord, 1980), differential item functioning (DIF; Holland & Thayer 1988), or item-by-country interaction (OECD, 2016) in educational measurement literature. Because the main scope of the study is closely related to this type of item misfit in the context of LSAs, we, henceforth, refer to the misfit of items as DIF throughout the paper. In addition, it is worthwhile note that in this study we considered the DIF as the fixed effect in the context of LSAs, which is consistent with the PISA operational approach. However, previous LSA studies also have considered DIF as the random effect and incorporated the hierarchical random effect model to examine the effect of DIF (De Jong et al., 2007; Fox & Verhagen, 2018).

DIF Adjustment and Group Score Estimation

To address DIF items in IRT scaling, the unique item parameter calibration approach has been proposed and implemented in operational settings (OECD, 2016, 2019). Specifically, in subsequent IRT scaling procedures, unique item parameters (group-specific or cycle-specific) are separately estimated for the items detected as DIF. This adjustment for DIF items has several advantages in terms of psychometric properties and scale comparability in the context of LSAs. First, estimating unique item parameters significantly improved the overall model fit (Joo et al., 2021; Oliveri & von Davier, 2011; Oliveri & von Davier, 2014; Rutkowski & Svetina, 2014, 2017). For example, Oliveri & von Davier (2011) applied various IRT models to the PISA 2006 cognitive domain data and compared the fitted models. They concluded that the multigroup 2PLM with partially unique item parameters was the best fitting model based on several model fit indices, including the Akaike Information Criterion (AIC; Akaike 1974) and the Bayesian Information Criterion (BIC; Schwarz 1978).

Second, the group-specific unique item parameter approach for addressing DIF items can reduce the bias in the group score estimates and increase the stability of group rankings (Rutkowski & Rutkowski, 2018). Note that the bias in group means depends on the interplay of the distribution of DIF effects and the chosen linking method (Robitzsch, 2021). In this study, we mainly focus on the bias of the group mean caused by the DIF distribution. Although the true group mean parameters and group rankings are unknown in real data applications, it has been shown via simulation studies that the group-specific unique item parameter estimation can produce accurate group mean parameter estimates. More specifically, group specific unique item parameters reduce the bias which is defined as the difference between the generating and estimated group mean parameters. For example, Rutkowski et al., (2016) conducted a simulation study that mimicked the PISA 2009 main survey design and investigated the country achievement estimates. They compared several approaches for computing country achievement estimates by varying the samples for item parameter calibration. Their results showed that the most restrictive sample, with common item parameters, produced bias in the country achievement estimates up to 12.49 on the PISA scale, and the less restrictive sample reduced the gap between the true and estimated country achievement estimates.

Purpose

Although several previous studies have shown the psychometric benefits of allowing unique item parameters for DIF items, it is yet unknown the degree to which the proportion of unique item parameters can be acceptable without harming the comparability of test results across participating groups and cycles. For example, in operational LSA, several groups are investigated further in IRT scaling when they require a large proportion of unique item parameters, indicating possible data quality or integrity issues that may affect group score comparability. We defined the group score comparability as the proportion of international item parameters (i.e., invariant item parameters) across country and language groups in this study (OECD, 2019). Although Rutkowski et al., (2016) reported that a less restrictive calibration sample showed a less biased result of the country performance scores, it is unknown whether allowing country-, language-, or cycle-specific unique item parameters can still produce comparable group score results. More importantly, a simulation study that manipulates a different level of DIF items is needed to systematically examine the extent to which the group scores are biased by the DIF items and how effectively the DIF adjustment corrects the bias.

Therefore, the purposes of the study are (a) to examine the issue with DIF items in the context of LSAs, (b) to quantify their impact to help researchers and practitioners interpret group-level score results more carefully, and (c) to provide an empirically-based recommendation for addressing the issues with DIF items. To achieve such purposes, we conducted two studies: In the first study, we compared the precision and reliability of country-level score estimates computed with and without the DIF adjustment using the PISA 2018 main survey data. We incorporated the Jackknife sampling method to obtain the country score difference and standard error estimates. In the second study, we conducted a simulation study to investigate the impact of DIF items and the adjustment on the group mean and standard deviation estimates using the multigroup IRT scaling approach.

Study 1

PISA 2018 main survey data

To empirically investigate the impact of the unique item parameter approach on the group scores in the context of LSAs, we analyzed the PISA 2018 cognitive domains (Reading, Mathematics, and Science) main survey data. In the PISA 2018 main survey, Reading was the major domain in that Reading items were administered to all students, and Mathematics and Science were the minor domains in that one of the domains, either Mathematics or Science, was administered. Depending on the participating countries, the PISA also have been administered in either computer-based assessment (CBA) or paper-based assessment (PBA) mode since the 2015 cycle. In 2018, 70 countries participated in CBA and nine countries participated in PBA. A total of 244 Reading items was administered for CBA countries, consisting of 172 new items and 72 trend items. For PBA countries 72 Reading trend items were administered. For Mathematics, 83 items were administered to both CBA and PBA countries, and for Science, 115 items were administered to CBA countries and 85 to PBA countries, all of which had been administered in the previous cycle. In this analysis, we considered all cognitive major and minor domain data and also included both CBA and PBA countries. Moreover, we considered country-by-language groups where countries that have multiple languages were divided into multiple language groups. Thus, the total number of country-by-language groups comprised 85 for CBA countries and 12 for PBA countries. Note that the country-by-language group approach is consistent with the PISA 2018 operational procedure (OECD, 2019).

As typical for LSAs, PISA also incorporates the balanced incomplete block (BIB) design for test administration. In the BIB design, students are required to respond to only a subset of the total item pool. The BIB design is a commonly used test administration design, especially for LSAs, because large-scale surveys generally cover a broad range of content and information. Using the BIB design, unbiased group-level score estimates can be obtained without overwhelming participating students with a large number of items. However, because only a subset of items is administered to students, a large proportion of data is missing by design. In this analysis, these missing responses were excluded from the IRT scaling, and they do not contribute to the item parameter estimation. Finally, we used senate weights so that the sample size per country to be equal as 5,000 (OECD, 2019).

IRT scaling and group score estimation

To analyze PISA cognitive domain data, we conducted multigroup IRT calibration (Bock & Zimowski, 1997). We initially applied the equality constraint across all country-by-language groups. More specifically, for new items, item parameters of all groups were estimated to be the same across groups. For trend items, item parameters were fixed at the estimates from previous assessment cycles. Note that the fixed item parameters were concurrently calibrated from data collected from PISA cycles 2006 to 2015. This fixed item parameter calibration approach is commonly used in operational settings to put the current assessment cycle scales on the same metric. Data from each PISA cognitive domain (Reading, Mathematics, and Science) was separately calibrated with the IRT models, such as 2PLM for dichotomous responses and GPCM for polytomous responses. The multigroup 2PLM (Eq. 1) and GPCM (Eq. 2) probability functions are described as:

$$P\left({X}_{ijg}=1|{\theta }_{ig}\right)=\frac{\text{exp}\left[D{a}_{j}\left({\theta }_{ig}-{b}_{j}\right)\right]}{1+\text{exp}\left[D{a}_{j}\left({\theta }_{ig}-{b}_{j}\right)\right]}$$

(1)

$$P\left({X}_{ijg}=k|{\theta }_{ig}\right)=\frac{\text{exp}\left[\sum _{r=0}^{k}D{a}_{j} \left({\theta }_{ig}-{b}_{j}+{t}_{jr}\right)\right]}{\sum _{u=0}^{{m}_{j}}\text{exp}\left[\sum _{r=0}^{u}D{a}_{j} \left({\theta }_{ig}-{b}_{j}+{t}_{jr}\right)\right]}$$

(2)

where ${\theta }_{ig}$ is latent trait parameter for the i^th student for g^th group, ${a}_{j}$ is discrimination parameter, ${b}_{j}$ is difficulty parameter, and ${t}_{jr}$ is category threshold parameter of the j^th item. D is the scaling constant for the logit link function, assumed to be 1.7. For the GPCM, ${m}_{j}$ is the total number of categories – 1 for the j^th item (e.g., ${X}_{ijg}$ = 0, 1, …, ${m}_{j}$), and the category threshold parameter has additional constraints ${t}_{j0}=0$ and $\sum _{r=1}^{k}{t}_{jr}=0$. In the multigroup structure, ${\theta }_{ig}$ is assumed to be distributed as N(${\mu }_{g}$, ${\sigma }_{g}^{2}$), where ${\mu }_{g}$ is the mean and ${\sigma }_{g}^{2}$ is the variance of the g^th group. The parameters of the multigroup 2PLM and GPCM were estimated using marginal maximum likelihood (MML) estimation with expectation-maximization (EM) algorithm (Bock & Aitkin, 1981).

To evaluate the group-level score accuracy and precision, we estimated the group-specific posterior mean and standard deviation estimates from the multigroup IRT model. The mean and standard deviation estimates were estimated for each country-by-language group, and the estimates were rescaled to the PISA scale using the transformation coefficients. The transformation coefficients consist of scaling (A) and centering (B) factors and can be used to make a linear transformation from the logit scale to the PISA scale for group scores.

$$PISA_{g}= A{\mu }_{g}+B$$

(3)

Each of the PISA cognitive domains has different transformation coefficients. In this study, we used the transformation coefficients that have been provided in the PISA 2015 Technical Report (OECD, 2016). For example, the scaling factor A was 131.58, and the centering factor B was 437.95 for the Reading domain. Similarly, for Mathematics and Science, respectively, the scaling factors A were 135.90 and 168.32, and the centering factors B were 514.18 and 494.54. The detailed description about computing the transformation coefficients is delineated in the PISA 2015 Technical Report (OECD, 2016). It is important to note that the rescaled group score estimates considered in this study are different than the typical LSA operational procedure. In operational settings, a latent regression model is generally used to address the heterogeneity of the group population distribution (Mislevy, 1984; Mislevy et al., 1992). Moreover, several plausible values (PV) are randomly drawn from the posterior distributions for individuals to compute the proficiency estimates for groups (von Davier et al., 2009). However, our preliminary studies found high Pearson correlation between rescaled PISA country mean scores and PV-based PISA country mean scores (above 0.95 across all domains). In addition, because the main purpose of the study is to investigate the impact of DIF on group score estimates, we used the direct estimates of the group scores from the multigroup IRT model and transformed the estimates to the PISA scale. This approach can also reduce possible confounding effects from the latent regression model and the PV procedure.

DIF detection and adjustment

After the initial multigroup IRT scaling was done, we evaluated item fit using the two quantities: mean deviation (MD) and root mean squared deviation (RMSD), for each item-by-group. The MD and RMSD for item j were computed as:

$$M{D}_{jg}=\int \left[{P}_{jg}^{obs}\left(\theta \right)-{P}_{jg}^{exp}\left(\theta \right)\right]{f}_{g}\left(\theta \right)d\theta$$

(4)

$$RMS{D}_{jg}=\sqrt{\int {{\left[{P}_{jg}^{obs}\left(\theta \right)-{P}_{jg}^{exp}\left(\theta \right)\right]}^{2}f}_{g}\left(\theta \right)d\theta }$$

(5)

where ${P}_{jg}^{obs}\left(\theta \right)$indicates the group-specific observed item characteristic curve (ICC) of item j, and ${P}_{jg}^{exp}\left(\theta \right)$indicates the group-specific expected ICC of item j. ${f}_{g}\left(\theta \right)$ also represents the estimated group density function for group g. The integrals in Eqs. 4 and 5 are approximated with Gaussian-Hermite quadrature points ranging from − 5 to 5 (von Davier, 2005).

To compute the observed ICC probability, we used the following definition:

$${P}_{jg}^{obs}\left(\theta \right)=\sum _{i=1}^{N}\frac{{x}_{ijg}{L}_{ig}\left(\theta \right|\varvec{X}){A}_{g}\left(\theta \right)}{{\sum }_{q=1}^{Q}{L}_{ig}\left({\theta }_{q}\right|\varvec{X}){A}_{g}\left({\theta }_{q}\right)}$$

(6)

where ${x}_{ijg}$ is the observed response from examinee i of group g for item j, ${A}_{g}\left(\theta \right)$ is the normalized group weight for group g, and ${L}_{ig}\left(\theta |\varvec{X}\right)$ is the likelihood function for examinee i of group g. The likelihood function is defined as:

$${L}_{ig}\left(\theta |\varvec{X}\right)=\prod _{j}^{ }P\left({X}_{ijg}={x}_{ijg}|\theta \right)$$

(7)

where $P\left({X}_{ijg}={x}_{ijg}|\theta \right)$ is the category response probability for ${x}_{ijg}$. To compute the expected ICC probability in Eqs. 4 and 5, we used the item response probability functions defined in Eqs. 1 and 2. Note that in this study, we computed the RMSD quantities based on sample statistics following the PISA operational scaling procedure (OECD, 2019). However, readers are referred to Köhler et al., (2020) for detailed descriptions and differences of the population and sample RMSD statistics.

The DIF item for each group was determined by using an RMSD cutoff of 0.12. That is, if the RMSD value for an item-by-group is greater than or equal to 0.12, then the item is flagged as DIF. Although various RMSD statistics and their cutoffs have been suggested (Robitzsch & Lüdtke, 2020, 2022), and a fixed RMSD cutoff could be unreasonable (Köhler et al., 2020; Robitzsch, 2022), in the current study, we used the conventional RMSD cutoff of 0.12 to detect DIF because the RMSD of 0.12 is currently used in the PISA and PIAAC operational scaling procedure for cognitive domains (OECD, 2016, 2019; Yamamoto et al., 2013). To be consistent with the operational procedure in LSAs and to increase the generalizability of the study, it is important to use the same RMSD cutoff value to detect DIF. In addition, the validity of the RMSD cutoff of 0.12 has been empirically evaluated in terms of scale comparability, overall model-data fit, and group score reliability (Joo et al., 2021).

To adjust the DIF in the multigroup IRT model, we re-estimated the unique item parameters in the subsequent scaling procedure. Specifically, item parameters detected as DIF were re-estimated for the DIF detected groups (i.e., DIF groups). In addition, we considered partially unique item parameters for the DIF groups. That is, if DIF items have the same direction of MD (positive or negative) for the DIF groups, the same unique item parameters were estimated across the DIF groups for the DIF item. Note that the partially unique item parameters approach has several advantages in the context of LSAs in that it could increase the scale comparability across the DIF groups and still hold partial invariance (Byrne et al., 1989). The DIF adjustment in the multigroup IRT model was iteratively conducted and continued until no DIF item-by-groups were detected.

Group score difference and Jackknife sampling

To investigate the impact of DIF on group score estimates, we compared the group scores with and without DIF adjustment. Specifically, we separately computed the rescaled PISA group (i.e., country-by-language group) scores from the initial IRT scaling, where no adjustment to misfit was considered, denoted as ${\widehat{\mu }}_{g0}$ for g = 1, … G, and the rescaled PISA group scores from the final IRT scaling, where adjustments to misfit took place, denoted as ${\widehat{\mu }}_{gF}$. Then the rescaled PISA group scores difference $d\left({\widehat{\mu }}_{g}\right)$ were computed as:

$$d\left({\widehat{\mu }}_{g}\right)={\widehat{\mu }}_{gF}-{\widehat{\mu }}_{g0}$$

(8)

We directly computed the group mean differences from the full invariance and partial invariance models. However, Robitzsch & Lüdtke (2022) recently proposed the adjusted and weighted group mean estimates and their statistical inference in the partial invariance approach. To estimate the standard error of the group mean difference estimate, we incorporated the Jackknife sampling method (Efron & Tibshirani, 1994). Using the Jackknife sampling approach, the sampling distribution of the group mean difference estimate can be formed and used to estimate the standard error of the group mean difference estimate. We first stratified the PISA item response data by respondents. More specifically, for each country-by-language group and senate weight, we stratified the samples, computed the statistic, and created the sampling distribution.

To describe the Jackknife sampling procedure more formally, suppose a stratum is denoted as ${\varvec{X}}_{s}$, for s = 1, …, S. We then first subtracted the stratum from the total PISA data denoted as ${\varvec{X}}_{-s}$. Note that the stratum was created and subtracted for each country-by-language group and senate weight, respectively, then combined to obtain ${\varvec{X}}_{-s}$. Once ${\varvec{X}}_{-s}$ is constructed, we applied the multigroup IRT scaling described in Eqs. 1 and 2 and computed the group score difference statistic using Eq. 8, denoted as ${d}_{-s}\left({\widehat{\mu }}_{g}\right)$. This procedure was conducted iteratively and continued until the number of iterations reached S. To summarize and report the result, we computed the following statistics:

$$\widehat{d}\left({\widehat{\mu }}_{g}\right)=\frac{1}{S}\sum _{s=1}^{S}{d}_{-s}\left({\widehat{\mu }}_{g}\right)$$

(9)

$$SE\left[\widehat{d}\left({\widehat{\mu }}_{g}\right)\right]=\sqrt{\frac{S}{S-1}\sum _{s=1}^{S}{\left[{d}_{-s}\left({\widehat{\mu }}_{g}\right)-\widehat{d}\left({\widehat{\mu }}_{g}\right)\right]}^{2}}$$

(10)

Note that the group score difference statistic was separately computed for each cognitive domain and each assessment type (CBA and PBA). To explore the results consistent with the PISA score reporting, we aggregated the results to the country-level groups, denoted as $\widehat{d}\left({\widehat{\mu }}_{c}\right)$, and $SE\left[\widehat{d}\left({\widehat{\mu }}_{c}\right)\right]$, c = 1, …, C by using senate weights:

$$\widehat{d}\left({\widehat{\mu }}_{c}\right)=\sum _{l=1}^{{L}_{c}}\frac{{w}_{l\left(c\right)}}{\sum _{m=1}^{{L}_{c}}{w}_{m}}\widehat{d}\left({\widehat{\mu }}_{l\left(c\right)}\right)$$

(11)

$$SE\left[\widehat{d}\left({\widehat{\mu }}_{c}\right)\right]=\sum _{l=1}^{{L}_{c}}\frac{{w}_{l\left(c\right)}}{\sum _{m=1}^{{L}_{c}}{w}_{m}}SE\left[\widehat{d}\left({\widehat{\mu }}_{l\left(c\right)}\right)\right]$$

(12)

where ${\widehat{\mu }}_{l\left(c\right)}$ is the group score and ${w}_{l\left(c\right)}$ is the senate weight of the l^th language group for the c^th country, and ${L}_{c}$ indicates the total number of language groups for the c^th country.

Study 1 Results

Proportion of DIF items

Figure 1 shows the proportion of DIF items for each country across PISA scores. The results were summarized by each cognitive domain and assessment type. As shown in Fig. 1, the proportion of DIF items was mainly higher for high- or low-level performance countries. In addition, the Reading domain has the highest proportion of DIF items followed by Science and Mathematics. For the Reading domain, the proportion of DIF items ranged from 7 to 33% with a mean of 14% for CBA countries. The corresponding values ranged from 7 to 39% with a mean of 23% for PBA countries. For Mathematics, the proportion of DIF items ranged from 1 to 36% with a mean of 8% for CBA countries and from 1 to 42% with a mean of 14% for PBA countries. For Science, the proportion of DIF items ranged from 4 to 26% with a mean of 13% for CBA countries, and from 9 to 39% with a mean of 20% for PBA countries. Note that the proportions of DIF items are similar to the PISA 2018 scaling results provided in the technical report (OECD, 2019).

Country score difference

Figure 2 shows the country score difference from IRT scaling with and without DIF adjustment. The country score difference and their standard error estimates were obtained from the Jackknife sampling method. The left column of Fig. 2 shows the country score differences across the proportions of DIF items and the right column shows the standard error of the country score estimates.

As shown in Fig. 2 panels a, c, and e, the country score differences tend to increase as the proportion of DIF items increased. As expected, the country score differences were substantial for the Reading domain, given that the proportion of DIF items was relatively high. For Reading, the minimum and maximum differences were − 10.88 points and 12.89 points, respectively, and the average difference was 1.36 points for CBA countries. For PBA countries, the minimum and maximum differences were − 3.43 points and 3.90 points, and the average difference was 1.52 points. Similarly, for Mathematics, the minimum and maximum differences were − 16.85 points and 5.48 points, and the average difference was − 0.11 points for CBA countries. For PBA countries, the minimum and maximum differences were − 16.25 points and 1.92 points, and the average difference was − 3.20 points. Lastly, for Science, the minimum and maximum differences were − 10.13 points and 5.33 points, and the average difference was 0.46 points for CBA countries. For PBA countries, the minimum and maximum differences were − 7.67 points and 4.86 points, and the average difference was − 0.41 points.

Because the positive and negative group score differences can cancel each other, we additionally explored the descriptive statistics for the absolute score differences. For CBA countries, the absolute score differences ranged from 0.01 to 12.89 with the average of 2.50 points for Reading, 0.04 to 16.85 with the average of 1.93 points for Mathematics, and 0.02 to 10.13 with the average of 1.84 points for Science. Similarly, for PBA countries, the absolute score differences ranged from 1.51 to 3.90 with the average of 2.62 points for Reading, 0.00 to 16.25 with the average of 4.13 points for Mathematics, and 0.67 to 7.67 with the average of 2.81 points for Science.

Finally, it was clearly shown that standard error estimates increased substantially as the proportion of DIF items increased. As shown in Fig. 2 (panels b, d, and f), the standard error estimates for all cognitive domains consistently increased. For CBA countries, the standard errors ranged from 0.44 to 4.47 with the average of 1.83 for Reading, 0.01 to 11.57 with the average of 1.91 for Mathematics, and 0.71 to 5.49 with the average of 1.79 for Science. Similarly, for PBA countries, the standard errors ranged from 0.69 to 8.35 with the average of 2.71 for Reading, 0.00 to 6.88 with the average of 2.51 for Mathematics, and 0.89 to 7.84 with the average of 3.58 for Science.

Study 2

Although we showed the impact of DIF items on group score estimates and their standard errors from the empirical PISA data, the extent to which DIF items cause the bias in the group score estimates and how the DIF adjustment addresses this bias is still unknown. Therefore, we conducted a simulation study in which data were generated with various levels of DIF in the context of LSAs.

Simulation design

For the simulation study design, we set the total number of items administered to each student to 40 and the number of item responses from each item to 500; we fixed the total number of groups to 10. In addition, we considered dichotomous response data only, given that the majority of items in PISA consist of dichotomous response items. We varied four simulation conditions associated with how DIF items were generated:

1.
Proportion of DIF items: 10%, 20%, and 40% of total items.
2.
Type of DIF items: a parameter shift (nonuniform DIF) and b parameter shift (uniform DIF).
3.
Size of DIF items: 0.3 (small) or 0.6 (large) for a parameter shift, and 0.5 (small) or 1.00 (large) for b parameter shift.
4.
Direction of DIF items: positive and negative item parameter shift.

The proportions of DIF items we considered in the simulation conditions were consistent with the PISA 2018 cognitive domain data. As shown in Figs. 1 and 2, the proportion of DIF items ranged from approximately 5–40% across countries. In addition, we considered two types of DIF items in the study: uniform and nonuniform DIF. Previous studies have shown that different types of DIF items affect Type I error and DIF detection rates (e.g., Buchholz & Hartig 2019; Stark et al., 2006), and we expect that uniform DIF would have a larger impact on group score estimates than nonuniform DIF in the multigroup IRT model. We also considered an a parameter shift of 0.3 and a b parameter shift of 0.5 as small DIF size and an a parameter shift of 0.5 and a b parameter shift of 1.0 as large. To understand the range of DIF size more explicitly, we investigated the distribution of item parameter differences using the PISA 2018 data. Figure 3 shows the distributions of discrimination (a) and difficulty (b) parameter differences for DIF and nonDIF items across countries. As expected, the Reading domain has the highest difference in item parameters, followed by Science and Mathematics. The discrimination parameter difference ranged approximately from − 1 to 1 and the difficulty parameter difference ranged from − 2 to 2. The distribution for the discrimination parameter difference also showed a unimodal shape, whereas the corresponding distribution for the difficulty parameter showed a bimodal distribution. Based on the item parameter difference distribution, we chose the DIF size values for the simulation study.

Data generation

The true item parameters were randomly drawn from uniform distributions with the ranges commonly observed in general LSAs. For discrimination parameters, the item parameters were randomly drawn from U(0.75, 2.25), and for difficulty parameters, the item parameters were randomly drawn from U(–2.00, 2.00). Note that the item parameters were randomly drawn for each replication to reduce the impact of item parameters on DIF items. To generate simulees for each group, we randomly generated latent trait parameters from the N(${\mu }_{g}$, 1), where ${\mu }_{g}$ is the true group mean parameter for group g. The true group mean parameters ${\mu }_{g}$ were also randomly generated from U(–2, 2), which are in the range of commonly observed group scores in LSAs.

To generate DIF, we chose two groups (Group 2 and Group 3) as DIF groups. For the DIF groups, depending on the simulation conditions (e.g., proportions of DIF items, type of DIF items, size of DIF items, and direction of DIF items), we created item parameters that are different than the true item parameters (i.e., DIF item parameters). For example, for the condition where 40% of total items, uniform, large, and positive direction DIF were considered, we randomly selected 16 items of out 40 items and added the value of 1 to the true b parameters. Similarly, for the condition where 20% of total items, nonuniform, small, and negative direction DIF were considered, we randomly selected 8 items of out 40 items and subtracted the value of 0.3 from the true a parameters. We then generated the item response data using the DIF item parameters along with the DIF group simulees. For the DIF-free (nonDIF) groups, we used the nonDIF group-specific item parameters along with the group simulees to generate the item response data. Finally, the item response data for both DIF and nonDIF groups were combined to create the total dataset.

Analysis

The generated item responses were analyzed with two multigroup IRT models. Specifically, we first fitted a multigroup model with item parameter equality constraints across groups (denoted as DIF unadjusted model). We then identified the DIF items using RMSD and re-estimated the multigroup model with unique item parameters (denoted as DIF adjusted model). The group mean and standard deviation estimates from both models (i.e., DIF unadjusted and adjusted models) were also separately estimated. The estimated group mean estimates were then rescaled by multiplying 100 and adding 500 to be similar to the PISA scale scores. Similarly, the estimated group standard deviation estimates were also rescaled by multiplying 100. To evaluate the accuracy of the mean and standard deviation estimates, we computed bias and root mean squared error (RMSE) as follows:

$${Bias}_{g}=\frac{\sum _{r=1}^{R}{\widehat{\delta }}_{g\left(r\right)}-{\delta }_{g\left(r\right)}}{R}$$

(13)

$${RMSE}_{g}=\sqrt{\frac{\sum _{r=1}^{R}{\left({\widehat{\delta }}_{g\left(r\right)}-{\delta }_{g\left(r\right)}\right)}^{2}}{R}}$$

(14)

where ${\widehat{\delta }}_{g\left(r\right)}$ is the estimated group mean or group standard deviation of the g^th group at the r^th replication, ${\delta }_{g\left(r\right)}$ is the true group mean or group standard deviation parameter for the g^th group at the r^th replication, and R is the total number of replications. For the current study, we set the total number of replications to 100. The bias and RMSE were computed for each group and separately averaged for the DIF groups and the nonDIF groups.

Study 2 Results

Table 1 shows bias and RMSE results for the group mean estimates from the DIF unadjusted and adjusted models across simulation conditions. Overall, the group mean bias was more substantial for the DIF groups than the nonDIF groups. More importantly, the DIF adjustment considerably reduced the bias and RMSE for the DIF groups. For the nonDIF groups, the bias ranged from − 0.82 to 1.31 across the simulation conditions and the average bias from the DIF unadjusted and adjusted models were − 0.16 and 0.13, respectively, for positive DIF and 0.14 and 0.01 for negative DIF. The bias for the nonDIF groups can be considered minimal based on the criteria provided by Hoogland & Boomsma (1998). For the DIF groups, the bias from the DIF unadjusted model ranged from − 38.13 to 39.34, and the average bias was − 8.35 for positive DIF and 8.43 for negative DIF. However, using the DIF adjusted model substantially reduced the bias by 50% on average. The bias from the DIF adjusted model ranged from − 17.64 to 17.40, and the average bias was − 3.61 for positive DIF and 3.52 for negative DIF. From the RMSE results, we found a similar pattern. RMSE of the nonDIF groups was consistent across the simulation conditions, whereas the corresponding value of the DIF groups ranged from 7.19 to 41.14 for the DIF unadjusted model. Similarly, using the DIF adjusted model substantially reduced the RMSE by 50%, ranging from 6.79 to 24.54 across the simulation conditions.

Table 1 Bias and RMSE of Group Mean Estimates for DIF and nonDIF groups across the multigroup IRT models

Full size table

As the size and proportion of DIF increased, the bias and RMSE increased substantially, as expected. It is worthwhile to note that bias of the group mean estimates were more evident when DIF was created by shifting the b parameter (uniform DIF) than the a parameter (nonuniform DIF). When nonuniform DIF was considered, the highest bias was 1.93 across simulation conditions and fitted models. The bias and RMSE from nonuniform DIF were comparable to the results from the nonDIF groups. In addition, the direction of DIF also affected the direction of bias for the group mean estimates. For the positive DIF conditions, the direction of bias was negative, indicating that the group mean estimates were underestimated. For the negative DIF conditions, the direction of bias was positive indicating that the group mean estimates were overestimated.

Table 2 illustrates the bias and RMSE of the group standard deviation estimates for the nonDIF and DIF groups. The overall pattern of the results was similar to the group mean estimate results. For the nonDIF groups, the bias ranged from − 1.48 to 0.20 and the average bias was − 0.85 for positive DIF and − 0.43 for negative DIF. In contrast, for the DIF groups, the bias was nonignorable, ranging from − 10.68 to 2.20 using the DIF unadjusted model and − 8.19 to 2.80 using the DIF adjusted model. The average bias of the DIF unadjusted and adjusted models were − 1.26 and − 0.94 for positive DIF and − 3.92 and − 2.99 for negative DIF. Overall, the bias showed negative values across the simulation conditions, indicating that the standard deviation estimates were underestimated regardless of the direction of DIF items. In addition, the bias was more substantial for the negative DIF than the positive DIF.

Table 2 Bias and RMSE of Group Standard Deviation Estimates for DIF and nonDIF groups across the multigroup IRT models

Full size table

Discussion and Conclusion

In this study, we examined the impact of DIF items on group scores in the context of LSAs. Although much literature has previously discussed the benefits of the IRT calibration method for addressing DIF items in the multigroup structure (e.g., Oliveri & von Davier 2011, 2014, Rutkowski & Svetina, 2014; Rutkowski et al., 2016; von Davier et al., 2019), the degree to which the DIF adjustment affects the accuracy and precision of group performance estimates had not yet been empirically shown. To fill this gap, we conducted two studies. In the first study, we empirically showed the impact of the DIF adjustment on country score estimates using PISA 2018 main survey data. To precisely examine the effects of DIF adjustment, we incorporated Jackknife sampling to estimate the country score difference estimates and their standard errors. In the Jackknife sampling approach, we incorporated the DIF adjustment within the multigroup IRT scaling process for group comparisons. The multigroup IRT model with the DIF adjustment takes the uncertainty of items across countries and assessment cycles into account by simultaneously estimating international item parameters and fixing trend item parameters from previous assessment cycles (OECD, 2016, 2019; von Davier et al., 2019). This approach is comparable to the linking method using trend items in the presence of DIF (Robitzsch, 2021; Robitzsch & Lüdtke, 2019). In the second study, we conducted a simulation study to explore the consequence of DIF items and their adjustment on the group mean estimates directly obtained from the multigroup IRT models.

Based on the first study results, we found that the DIF items have a nonnegligible impact on the country scores and their standard error estimates for PISA 2018 cognitive domains. As the proportion of DIF items increased, the difference of the country score estimates obtained with and without the DIF adjustment considerably increased. The highest country score difference was − 16.85 points on the PISA scale, observed in the Mathematics domain when the proportion of DIF items for the country was nearly 40%. Across countries, the Reading domain showed the largest score differences followed by Science and Mathematics, given that the proportion of DIF items was largest for Reading. In addition, we found that the standard error of country score differences increased as the proportion of DIF items increased, implying that the country score reliabilities can also be affected by DIF items. Consistent with the country score difference results, the standard error was highest for Reading followed by Mathematics and Science. Given that the proportion of DIF items per country in PISA 2018 data was as high as 40%, the results from the PISA data analysis provide the empirical evidence in which the DIF adjustment affects the country scores.

In the second study, we computed bias and RMSE of the group mean estimates from the two multigroup IRT models; the model with the constrained item parameters (DIF unadjusted model) and the model with the unique item parameters for the detected DIF items (DIF adjusted model). The data were generated by varying the proportion, size, type, and direction of DIF items, and we obtained the group mean estimates directly from the DIF adjusted and unadjusted models. We first found that the group mean estimates were underestimated when uniform DIF items were generated with a positive direction and overestimated when uniform DIF items were generated with a negative direction. The group mean bias increased as the proportion and size of DIF items increased, and the bias was as high as 39.34 points on the PISA scale when 40% of the items contained large DIF. More importantly, we also found that the DIF adjusted model reduced the bias of group mean estimates across the simulation condition and 50% of the bias was reduced on average. However, the DIF adjusted model still produced bias; − 3.61 points on average for positive DIF and 3.52 points for negative DIF on the PISA scale. These results indicate that the DIF items could yield biased group mean estimates, and the DIF adjustment can be implemented for the IRT scaling procedure to obtain the valid group mean estimates in the context of LSAs. In addition, the direction of DIF items should be carefully monitored to avoid the possible under or overestimation of the group mean estimates from the multigroup IRT model.

Interestingly, we also found that the group mean bias was mainly evident with the uniform DIF items, and the nonuniform DIF items had a minimum impact. This finding was consistent across the simulation conditions. This result implies that the group mean estimates are mainly affected by the uniform DIF items, and in operational settings, uniform DIF should be more explicitly investigated than nonuniform DIF. This result also highlights previous DIF studies in the context LSAs where uniform DIF is generally detected with high power than nonuniform DIF (e.g., Buchholz & Hartig 2019). Based on this finding, we recommend researchers and practitioners investigate DIF items more precisely by plotting ICCs, using common and group-specific item parameters.

In addition, we found that the group standard deviation estimates were also biased by the DIF items from the simulation study. Although the DIF adjustment somewhat addressed the bias, the group standard deviation estimates were underestimated overall, mainly with the negative direction DIF items, and the bias was as high as − 8.19 points on the PISA scale. The corresponding RMSE value was 4.73. This finding has an important implication in the context of educational research. For example, if educational researchers and practitioners are interested in meta-analyzing student performance, it is common to obtain standardized effect sizes by using standard deviation estimates to make a valid cross-country performance comparison. Moreover, statistical inferences, such as interval estimates and hypothesis testing for country scores, also heavily rely on the valid standard deviation estimates. To obtain accurate standardized effect sizes and make a valid statistical inference, DIF items should be properly revised.

However, it is worthwhile to note that the simulation study results should be interpreted with caution. In operational assessments such as LSAs, the group score comparability is the main interest and estimating unique item parameters for DIF fundamentally decreases the comparability of the scale because the number of international item parameters reduces (Note that comparability is defined as the proportion of international item parameters in this study). From the psychometric perspective, it is critical to maintain the high comparability of the scales across groups and obtain the comparable group scores. Although we showed that the DIF items can cause the nonignorable bias of the group mean and standard deviation estimates from the simulation study, increasing the number of unique item parameters to address DIF items reduces the comparability of the scales and increases the model complexity. To obtain the comparable group scores in LSAs, it is important to primarily consider the high level of scale comparability and measurement invariance across groups (Rutkowski & Svetina, 2014). We also emphasize that the statistical decision on the DIF adjustment does not always relate to construct-irrelevancy (Robitzsch & Lüdtke, 2020). As previous DIF studies discussed, DIF detection and adjustment should depend on statistical decisions and reviews from item experts and developers (Penfield & Camilli, 2007).

Finally, we acknowledged the limitations of the studies. Specifically, the simulation study we designed only investigated the limited data generation conditions. For example, the number of groups in the simulation was fixed at ten, and the number of items administered to students was fixed at 40. Although the numbers of fixed groups and items in this study are commonly observed in typical LSAs, to increase the generalizability of the results, more data generation conditions should be explored. Increasing the number of groups and items could affect the group mean and standard deviation estimates from the multigroup IRT model, and a future study is needed to examine the impact. In addition, in the simulation study, we only considered the dichotomous item response model (e.g., 2PLM) to generate the data. Given that the LSAs in general often include mixed-format tests, it is important to consider both dichotomous and polytomous item response data and investigate the impact of the DIF items on the group score estimates. Furthermore, the DIF detection method using RMSD assumes that the functional form of the fitted model adequately describes the data. In our empirical investigation, we used 2PL and GPCM for the dichotomous and polytomous responses in accordance with the PISA and PIAAC operational procedures. The future research should investigate the impact of DIF in LSAs using nonparametric DIF detection methods such as logistic regression. Finally, we did not include BIB design in the data generation procedure. The BIB design is commonly used in LSAs to cover a wide range of content and obtain reliable group score estimates. A future study should include the BIB design in the data generation conditions and investigate the impact of the BIB design along with the DIF items on group score estimates.

The results from the two studies provide important evidence that the DIF adjustment in IRT scaling is important and effective to address possible bias in group score reporting. We believe that the study contributes to the measurement literature in general and specifically to large-scale group-score assessments, providing information about DIF items and their consequences. Additionally, the study would provide guidelines for researchers and practitioners on how to properly address DIF item issues in the context of LSAs. The study results could also help lead to the development of new methods or modeling frameworks that consider the magnitude of misfit and consequently improves the current operational work in national and international LSAs.

Data Availability

The datasets analyzed during the current study are available in the OECD PISA-data repository https://www.oecd.org/pisa/data/2018database.

Change history

22 December 2022
A Correction to this paper has been published: https://doi.org/10.1186/s40536-022-00149-1

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723
Article Google Scholar
Birnbaum, A. (1968). On the estimation of mental ability (Series Report No. 15). USAF School of Aviation Medicine
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459
Article Google Scholar
Bock, R. D., & Zimowski, M. F. (1997). Multiple group IRT. In W. J. van der Linden, & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 433–448). New York, NY: Springer
Buchholz, J., & Hartig, J. (2019). Comparing attitudes across groups: An IRT-based item-fit statistic for the analysis of measurement invariance. Applied Psychological Measurement, 43, 241–250
Article Google Scholar
Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456–466
Article Google Scholar
Cosgrove, J., & Cartwright, F. (2014). Changes in achievement on PISA: the case of Ireland and implications for international assessment practice. Large-scale Assessments in Education, 2, 1–17
Article Google Scholar
De Jong, M. G., Steenkamp, J. B. E., & Fox, J. P. (2007). Relaxing measurement invariance in cross-national consumer research using a hierarchical IRT model. Journal of Consumer Research, 34, 260–278
Article Google Scholar
Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. Boca Raton, FL: Chapman & Hill
Ercikan, K. (2002). Disentangling sources of differential item functioning in multilanguage assessments. International Journal of Testing, 2, 199–215
Article Google Scholar
Ercikan, K., & Koh, K. (2005). Examining the construct comparability of the English and French versions of TIMSS. International Journal of Testing, 5, 23–35
Article Google Scholar
Fox, J. P., & Verhagen, J. (2018). Random item effects modeling for cross-national survey data. In E. Davidov, P. Schmidt, & J. Billiet (Eds.), Cross-cultural analysis: Methods and applications (pp. 529–550). London: Routledge
Chapter Google Scholar
Gierl, M. J., & Khaliq, S. N. (2001). Identifying sources of differential item and bundle functioning on translated achievement tests: A confirmatory analysis. Journal of Educational Measurement, 38, 164–187
Article Google Scholar
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer, & H. I. Braun (Eds.), Test validity (pp. 129–145). Hilldale, NJ: Lawrence Erlbaum Associates
Hoogland, J. J., & Boomsma, A. (1998). Robustness studies in covariance structure modeling: An overview and a meta-analysis. Sociological Methods & Research, 26, 329–367
Article Google Scholar
Joo, S., Khorramdel, L., Yamamoto, K., Shin, H. J., & Robin, F. (2021). Evaluating item fit statistic thresholds in PISA: Analysis of cross-country comparability of cognitive items. Educational Measurement: Issues and Practice, 40, 37–48
Article Google Scholar
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking. New York, NY: Springer
Köhler, C., Robitzsch, A., & Hartig, J. (2020). A bias-corrected RMSD item fit statistic: An evaluation and comparison to alternatives. Journal of Educational and Behavioral Statistics, 45, 251–273
Article Google Scholar
König, C., Khorramdel, L., Yamamoto, K., & Frey, A. (2021). The benefits of fixed item parameter calibration for parameter accuracy in small sample situations in large-scale assessments. Educational Measurement: Issues and Practice, 40, 17–27
Article Google Scholar
Kreiner, S., & Christensen, K. B. (2014). Analyses of model fit and robustness: A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika, 79, 210–231
Article Google Scholar
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hilldale, NJ: Erlbaum
Mazzeo, J., & von Davier, M. (2014). Linking scales in international large-scale assessments. In L. Rutkowski, von M. Davier, & D. Rutkowski (Eds.), Handbook of international large scale assessment (pp. 229–257). Boca Raton, FL: CRC Press
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543
Article Google Scholar
Mislevy, R. J. (1984). Estimating latent distributions. Psychometrika, 49, 359–381
Article Google Scholar
Mislevy, R. J., Johnson, E. G., & Muraki, E. (1992). Chapter 3: Scaling procedures in NAEP. Journal of Educational Statistics, 17, 131–154
Google Scholar
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176
Article Google Scholar
Neumann, K., Fischer, H. E., & Kauertz, A. (2010). From PISA to educational standards: The impact of large-scale assessments on science education in Germany. International Journal of Science and Mathematics Education, 8, 545–563
Article Google Scholar
Organization for Economic Co-Operation and Development (2016). PISA 2015 Technical Report. http://www.oecd.org/pisa/data/2015-technical-report
Organization for Economic Co-Operation and Development (2019). PISA 2018 Technical Report.http://www.oecd.org/pisa/data/2018-technical-report
Oliveri, M. E., & von Davier, M. (2011). Investigation of model fit and score scale comparability in international assessments. Psychological Test and Assessment Modeling, 53, 315–333
Google Scholar
Oliveri, M. E., & Von Davier, M. (2014). Toward increasing fairness in score scale calibrations employed in international large-scale assessments. International Journal of Testing, 14, 1–21
Article Google Scholar
Robitzsch, A. (2020). Lp loss functions in invariance alignment and Haberman linking with few or many groups. Stats, 3, 246–283
Article Google Scholar
Robitzsch, A. (2021). Robust and nonrobust linking of two groups for the Rasch model with balanced and unbalanced random DIF: A comparative simulation study and the simultaneous assessment of standard errors and linking errors with resampling techniques. Symmetry, 13, 2198
Article Google Scholar
Robitzsch, A. (2022). Statistical properties of estimators of the RMSD item fit statistic. Foundations, 2, 488–503
Article Google Scholar
Robitzsch, A., & Lüdtke, O. (2019). Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assessment in Education: Principles Policy & Practice, 26, 444–465
Google Scholar
Robitzsch, A., & Lüdtke, O. (2020). A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psychological Test and Assessment Modeling, 62, 233–279
Google Scholar
Robitzsch, A., & Lüdtke, O. (2022). Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. Journal of Educational and Behavioral Statistics, 47, 36–68
Article Google Scholar
Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International large-scale assessment data: Issues in secondary analysis and reporting. Educational Researcher, 39, 142–151
Article Google Scholar
Rutkowski, L., & Rutkowski, D. (2018). Improving the comparability and local usefulness of international assessments: A look back and a way forward. Scandinavian Journal of Educational Research, 62, 354–367
Article Google Scholar
Rutkowski, D., Rutkowski, L., & Liaw, Y. L. (2018). Measuring widening proficiency differences in international assessments: Are current approaches enough? Educational Measurement: Issues and Practice, 37, 40–48
Article Google Scholar
Rutkowski, L., Rutkowski, D., & Zhou, Y. (2016). Item calibration samples and the stability of achievement estimates and system rankings: Another look at the PISA model. International Journal of Testing, 16, 1–20
Article Google Scholar
Rutkowski, L., & Svetina, D. (2014). Assessing the hypothesis of measurement invariance in the context of large-scale international surveys. Educational and Psychological Measurement, 74, 31–57
Article Google Scholar
Rutkowski, L., & Svetina, D. (2017). Measurement invariance in international surveys: Categorical indicators and fit measure performance. Applied Measurement in Education, 30, 39–51
Article Google Scholar
Sachse, K. A., Roppelt, A., & Haag, N. (2016). A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. Journal of Educational Measurement, 53, 152–171
Article Google Scholar
Svetina, D., & Rutkowski, L. (2014). Detecting differential item functioning using generalized logistic regression in the context of large-scale assessments. Large-scale Assessments in Education, 2, 1–17
Article Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464
Article Google Scholar
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1292–1306
Article Google Scholar
von Davier, M. (2005). mdltm: Software for the general diagnostic model and for estimating mixtures of multidimensional discrete latent traits models [Computer software]. Princeton, NJ: ETS
Google Scholar
von Davier, M., Gonzalez, E., & Mislevy, R. (2009). What are plausible values and why are they useful?. In von M. Davier, & D. Hastedt (Eds.), Issues and methodologies in large scale assessments (2 vol.). Hamburg, Germany: IEA-ETS Research Institute.
Google Scholar
von Davier, M., Yamamoto, K., Shin, H. J., Chen, H., Khorramdel, L., Weeks, J., Davis, S., Kong, N., & Kandathil, M. (2019). Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assessment in Education: Principles Policy & Practice, 26, 466–488
Google Scholar
Wu, M. (2010). Measurement, sampling, and equating errors in largescale assessments. Educational Measurement: Issues and Practices, 29, 15–27
Article Google Scholar
Yamamoto, K., Khorramdel, L., & Von Davier, M. (2013). Scaling PIAAC cognitive data. Technical report of the survey of adult skills (PIAAC), Paris, France: OECD
Zwitser, R. J., Glaser, S. S. F., & Maris, G. (2017). Monitoring countries in a changing world: A new look at DIF in international surveys. Psychometrika, 82, 210–232

Download references

Acknowledgements

The authors would like to thank Emily Kerzabi for her editorial help.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Author information

Authors and Affiliations

University of Kansas, Kansas, USA
Sean Joo
Educational Testing Service, New Jersey, USA
Usama Ali & Frederic Robin
Sogang University, Seoul, South Korea
Hyo Jeong Shin
South Valley University, Qena, Egypt
Usama Ali

Authors

Sean Joo
View author publications
You can also search for this author in PubMed Google Scholar
Usama Ali
View author publications
You can also search for this author in PubMed Google Scholar
Frederic Robin
View author publications
You can also search for this author in PubMed Google Scholar
Hyo Jeong Shin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SJ conducted all analyses and prepared the initial draft of the manuscript. UA and FR provided the guidance concerning the analyses and read and approved the manuscript. HS commented on the manuscript.

Corresponding author

Correspondence to Sean Joo.

Ethics declarations

Ethics approval and consent to participate

This research was based on a desk-based systematic literature review and no ethics approval was required and no human subjects were involved in the research.

Consent for publication

The authors provide consent for publication of this paper in the journal.

Competing interests

The authors have no known competing interests to disclose.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Joo, S., Ali, U., Robin, F. et al. Impact of differential item functioning on group score reporting in the context of large-scale assessments. Large-scale Assess Educ 10, 18 (2022). https://doi.org/10.1186/s40536-022-00135-7

Download citation

Received: 08 January 2022
Accepted: 26 September 2022
Published: 15 November 2022
DOI: https://doi.org/10.1186/s40536-022-00135-7

Impact of differential item functioning on group score reporting in the context of large-scale assessments

Abstract

Introduction

DIF Adjustment and Group Score Estimation

Purpose

Study 1

PISA 2018 main survey data

IRT scaling and group score estimation

DIF detection and adjustment

Group score difference and Jackknife sampling

Study 1 Results

Proportion of DIF items

Country score difference

Study 2

Simulation design

Data generation

Analysis

Study 2 Results

Discussion and Conclusion

﻿Data Availability

Change history

22 December 2022

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Data Availability