 Research
 Open Access
Assessment of fit of item response theory models used in largescale educational survey assessments
 Peter W. van Rijn^{1},
 Sandip Sinharay^{2}Email authorView ORCID ID profile,
 Shelby J. Haberman^{3} and
 Matthew S. Johnson^{4}
https://doi.org/10.1186/s4053601600253
© The Author(s) 2016
 Received: 24 September 2015
 Accepted: 7 June 2016
 Published: 8 July 2016
Abstract
Latent regression models are used for scorereporting purposes in largescale educational survey assessments such as the National Assessment of Educational Progress (NAEP) and Trends in International Mathematics and Science Study (TIMSS). One component of these models is based on item response theory. While there exists some research on assessment of fit of item response theory models in the context of largescale assessments, there is a scope of further research on the topic. We suggest two types of residuals to assess the fit of item response theory models in the context of largescale assessments. The Type I error rates and power of the residuals are computed from simulated data. The residuals are computed using data from four NAEP assessments. Misfit was found for all data sets for both types of residuals, but the practical significance of the misfit was minimal.
Keywords
 Generalized residual
 Item fit
 Residual analysis
 Twoparameter logistic model
Introduction
Several largescale educational survey assessments (LESAs) such as the United States’ National Assessment of Educational Progress (NAEP), the International Adult Literacy Study (IALS; Kirsch 2001), the Trends in Mathematics and Science Study (TIMSS; Martin and Kelly 1996), and the Progress in International Reading Literacy Study (PIRLS; Mullis et al. 2003) involve the use of item response theory (IRT) models for scorereporting purposes (e.g., Beaton 1987; Mislevy et al. 1992; Von Davier and Sinharay 2014).
Standard 4.10 of the Standards for Educational and Psychological Testing (American Educational Research Association 2014) recommends obtaining evidence of model fit when an IRT model is used to make inferences from a data set. In addition, because of the importance of the LESAs in educational policymaking in the U.S. and abroad, it is essential to assess the fit of the IRT models used in these assessments. Although several researchers have examined the fit of the IRT models in the context of LESAs (for example, Beaton 2003; Dresher and Thind 2007; Sinharay et al. 2010), there is a substantial scope of further research on the topic (e.g., Sinharay et al. 2010).
This paper suggests two types of residuals to assess the fit of IRT models used in LESAs. One among them can be used to assess item fit and the other can be used to assess other aspects of fit of these models. These residuals are computed for several simulated data sets and four operational NAEP data sets. The focus in the remainder of this paper will mostly be on NAEP.
The next section provides some background, describing the current NAEP IRT model and the existing NAEP IRT modelfit procedures. The Methods section describes our suggested residuals. The data section describes data from four NAEP assessments. The Simulation section describes a simulation study that examined the Type I error rate and power of the suggested methods. The next section involves the application of the suggested residuals to the four NAEP data sets. The last section includes conclusions and suggestions for future research.
Background
The IRT model used in NAEP
Consider a NAEP assessment that was administered to students \(i, i=1, 2, \ldots , N\), with corresponding sampling weights \(W_i\). The sampling weight \(W_i\) represent the number of students in the population that student i represents (e.g., Allen et al. 2001 pp. 161–225). Denote the pdimensional latent proficiency variable for student i by \(\varvec{\theta }_{i} =(\theta _{i1} ,\theta _{i2} ,\ldots, \theta _{ip} )'\). In NAEP assessments, p is between 1 and 5. Denote the vector of item scores for student i by \(\mathbf y_i =(\mathbf y'_{i1} ,\mathbf y'_{i2} ,\ldots ,\mathbf y'_{ip} )'\), where the subvector \(\mathbf y_{ik}\) contains the item scores \(y_{ijk}, j \in J_{ik}\), of student i to items (that are indexed by j) corresponding to the kth dimension/subscale that were presented to student i. For example, \(\mathbf y_{ik}\) could be the scores of student i to the algebra questions presented to her on a mathematics test and \(\theta _{ik}\) could represent the student’s proficiency variable for algebra. Because of the use of matrix sampling (that refers to a design in which each student is presented only a subset of all available items) in NAEP assessments, the algebra questions administered to student i are a subset \(J_{ik}\) of the set \(J_k\) of all available algebra questions on the test. The item scores \(y_{ijk}\) can be an integer between 0 and \(r_{jk}>0\), where \(r_{jk}=1\) for dichotomous items and an integer larger than 1 for polytomous items with three or more score categories.
In this report, a limited version of the NAEP model is used where no background variables are employed, and \(\varvec{\theta }_i\) is assumed to have a multivariate normal distribution with means equal to 0 and constraints on the covariance matrix, so that the variances are equal to 1 if the covariances are all zero. Consideration of only this limited model allows us to focus on the fit of the IRT part (given by Eqs. 1 and 2) of the NAEP model rather than on the latent regression part (given by Eq. 3). If \(\mathbf z_i\) has a normal distribution, then this limited model is consistent with the general NAEP model. The suggested residuals can be extended to the case in which background variables are considered. These extensions are already present in the software employed in the analysis in this paper. For example, residuals involving item responses and background variables can indicate whether the covariances of item responses and background variables are consistent with the IRT model. The extensions can also examine whether problems with residual behavior found when ignoring background variables are removed by including background variables in the model. This possibility is present if background variables that are not normally distributed are related to latent variables.
Existing tools for assessment of fit of the NAEP IRT model

Because NAEP involves matrix sampling, common modelfit tools such as the itemfit statistics of Orlando and Thissen (2000), which are appropriate when all items are administered to all examinees, cannot be applied without modification.

Complex sampling involves both the sampling weights \(W_i\) and departures from customary assumptions of independent examinees due to sampling from finite populations and due to first sampling schools and then sampling students within schools.
The primary modelfit tool used in the NAEP operational analyses is graphical itemfit analyses using residual plots and a related \(\chi ^{2}\)type itemfit statistic (Allen et al. 2001 p. 233) that provide guidelines for treating the items (such as collapsing categories of polytomous items, treating adjacent year data separately in concurrent calibration, or dropping items from the analysis). However, the null distribution^{1} of these residuals and of the \(\chi ^{2}\)type statistic are unknown (Allen et al. 2001 p. 233).
Differential item functioning (DIF) analyses are also used in NAEP operational analysis to examine one aspect of multidimensionality (Allen et al. 2001 p. 233). In addition, the difference between the observed and modelpredicted proportions of students obtaining a particular score on an item (Rogers et al. 2006) are also examined in NAEP operational analyses. However, to evaluate when a difference can be considered large enough, the standard deviation of the difference is not used. It will be useful to incorporate the variability in the comparison of the observed and predicted proportions. As will be clear later, our proposed approach addresses this issue.
Sinharay et al. (2010) suggested a simulationbased modelfit technique similar to the bootstrap method (e.g., Efron and Tibshirani 1993) to assess the fit of the NAEP statistical model. However, their suggested statistics were computed at the booklet level rather than for the whole data set and the pvalues of the statistics under the null hypothesis of no misfit did not always follow a uniform distribution and were smaller than what was expected.
The above review shows that there is need for further research on the assessment of fit of the NAEP IRT model. We address that need by suggesting two new types of residuals to assess the fit of the NAEP IRT model.
Methods
Itemfit analysis using residuals
To assess item fit, Bock and Haberman (2009) and Haberman et al. (2013) employed a form of residual analysis in the context of regular IRT applications (that do not involve complex sampling or matrix sampling) that involves a comparison of two approaches to estimation of the item response function.
If the model holds and the sample size is large, then \(t_{jk}(y\varvec{\theta })\) has an approximate standard normal distribution. Arguments used in Haberman et al. (2013) for simple random sampling without replacement (where all \(W_i=1\), \(p=1\), and all \(J_{ik}\)’s are equal) apply virtually without change in sampling procedures under study. The asymptotic variance estimate \(s(\bar{X})\) is simply computed for \(X_i=\hat{d}_{yijk}(\varvec{\theta })\) for i in \(K_{jk}\) and \(X_i=0\) for i not in \(K_{jk}\) based on the complex sampling procedure used for the data. If the model does not fit the data and the sample is large, then the number of statistically significant residuals \(t_{jk} (y\varvec{\theta })\) will be much more than the nominal level.
Figure 1 shows examples of such plots for two dichotomous items. In each case, \(p=1\). For each item, the examinee proficiency is plotted along the Xaxis, the solid line denotes the values of the estimated ICC, that is, \(\hat{f}_{j1}(1\varvec{\theta })\) from Eq. 8 for the item and for the vector \(\varvec{\theta }\) with the single element \(\theta _1\), and the two dashed lines denote a pointwise 95 % confidence band consisting of the values of \(\bar{f}_{j1}(1\varvec{\theta })  2s_{j1} (1\varvec{\theta })\) and \(\bar{f}_{j1}(1\varvec{\theta })+2s_{j1} (1\varvec{\theta })\), where \(\bar{f}_{j1}(1\varvec{\theta })\) is given by Eq. 7. If the solid line falls outside this confidence band, that would indicate a statistically significant residual. These plots are similar to the plots of item fit provided by IRT software packages such as PARSCALE (Du Toit 2003). In Fig. 1, the right panel corresponds to an item for which substantial misfit is observed and the left panel corresponds to an item for which no statistically significant misfit is observed (the solid line almost always lies within the 95 % confidence band).
The ETS mirt software (Haberman 2013) was used to compute residuals for item fit for the NAEP data sets. The program is available on request for noncommercial use.
This itemfit analysis can be considered a more sophisticated version of the graphical itemfit analysis operationally employed in NAEP. While the asymptotic distribution of the residuals is not known in the analysis employed operationally, it is known in our proposed itemfit analysis.
Generalized residual analysis
Generalized residual analysis for assessing the fit of IRT models in regular applications (that do not involve complex sampling or matrix sampling) was suggested by Haberman (2009) and Haberman and Sinharay (2013). The methodology is very general and a variety of modelbased predictions can be examined under the framework.
A statistically significant absolute value of the generalized residual t indicates that the IRT model does not adequately predict the statistic O.
It is possible to create graphical plots using these generalized residuals. For example, one can create a plot showing the values of O and a 95 % confidence interval given by \(\hat{E} \pm 1.96s\). A value of O lying outside this confidence interval would indicate a generalized residual significant at the 5 % level. The ETS mirt software (Haberman 2013) was used to perform the computations for the generalized residuals.
Assessment of practical significance of misfit of IRT models
George Box commented that all models are wrong (Box and Draper 1987, p. 74). Similarly, Lord and Novick (1968, p. 383) wrote that it can be taken for granted that every model is false and that we can prove it so if we collect a sufficiently large sample of data. According to them, the key question, then, is the practical utility of the model, not its ultimate truthfulness. Sinharay and Haberman (2014) therefore recommended the assessment of practical significance of misfit, which comprises the determination of the extent to which the decisions made from the test scores are robust against the misfit of the IRT models. We assess the practical significance of misfit in all of our data examples. The quantities of most practical interest among those that are operationally reported in NAEP are the subgroup means and the percent at different proficiency levels. We examine the effect of misfit on these quantities.
Data
We next describe data from four NAEP assessments that are used to demonstrate our suggested residuals. These data sets represent a variety of NAEP assessments.
NAEP 2004 and 2008 longterm trend mathematics assessment at age 9

knowledge of basic mathematical facts,

ability to carry out computations using paper and pencil,

knowledge of basic measurement formulas as they are applied in geometric setting, and

ability to apply mathematics to dailyliving skills (such as those related to time and money).
NAEP 2002 and 2005 reading at grade 12
The NAEP Reading Grade 12 assessment (e.g., Perie et al. 2005) measures the reading and comprehension skills of students in grade 12 by asking them to read selected gradeappropriate passages and answer questions based on what they have read. The assessment measures three contexts for reading: reading for literary experience, reading for information, and reading to perform a task. The assessment contained a total of 145 multiplechoice and constructedresponse items divided over 38 booklets. Multiplechoice items were designed to test students’ understanding of the individual texts, as well as their ability to integrate and synthesize ideas across the texts. Constructedresponse items were based on consideration of the texts the students read. Each student read approximately two passages and responded to questions about what he or she read. The data set included about 26,800 examinees with 14,700 students from the 2002 sample and 12,100 students from the 2005 sample. It is assumed that there are three skills (or subscales) underlying the items, one each corresponding to the three contexts.
NAEP 2007 and 2009 mathematics at grade 8
The NAEP Mathematics Grade 8 assessment measures students’ knowledge and skills in mathematics and students’ ability to apply their knowledge in problemsolving situations. It is assumed that each item measures one among the five following skills (subscales): number properties and operations; measurement; geometry; data analysis, statistics and probability; and algebra. This Mathematics Grade 8 assessment (e.g., National Center for Education Statistics 2009) included 231 multiplechoice and constructedresponse items divided over 50 booklets. The full data set included about 314,700 examinees with 153,000 students from the 2007 sample and 161,700 students from the 2009 sample.
NAEP 2009 science at grade 12
The NAEP 2009 Science Grade 12 assessment (e.g., National Center for Education Statistics 2011) included 185 multiplechoice and constructedresponse items on physical science, life science, and earth and space science divided over 55 booklets. It is assumed that there is one skill underlying the items. The data set included about 11,100 examinees.
Results for simulated data
In order to check Type I error of the itemfit residuals and the generalized residuals for firstorder marginals and secondorder marginals, we simulated data that look like the abovementioned NAEP data sets but fit the model perfectly. The simulations were performed on a subset of examinees for Mathematics Grade 8 because of the huge sample size for the test. We used the itemparameter estimates from our analyses of the NAEP data sets using the constrained 3PL/GPCM (see Table 4). Values of \(\theta\) were drawn from the normal distribution with separate population means for the assessments with two years and unit variance. The original booklet design was used, but sampling weights and primary sampling units were not used.
Itemfit analysis using residuals
Type I error rates
Average Type I error for generalized residuals of item response functions
Assessment  Average type I error  

11 points  31 points  
LTT math age 9  12  10 
Reading grade 12  10  9 
Math grade 8  9  8 
Science grade 12  14  14 
The itemfit residuals were computed at either 11 or 31 points between −3 and 3. It can be seen that more residuals are significant for larger absolute values of \(\theta\), which is in line with earlier results (see, e.g., Haberman et al. 2013; Fig. 2). In addition, there is a relationship between the Type I error rates and the test information function; Type I error rates become larger as information goes down. Obviously, there is a relationship between the Type I error rates and the sample sizes. This is best seen for the Science Grade 12 data set (last row), which is the smallest data set (for which the sample size is about 11,000, with an average of about 1900 responses per item); first, the Type I error rates for itemfit residuals computed at 11 points get closer to the nominal \(\alpha\) of .05 than those at 31 points; second, the peak of the test information function is between \(\theta =1\) and \(\theta =2\), indicating that the items are relatively difficult (note that the mean and standard deviation of \(\theta\) are fixed to zero and one for model identification purposes). Given that there are not many students with \(\theta >2\) and that even for these students the items can still be relatively difficult, the Type I error rate shows a steep incline between \(\theta =2\) and \(\theta =3\).
Thus, it can be concluded that the Type I error rates for the itemfit residuals are close to their nominal value if there are enough students and if there is substantial information in the ability range of interest.
Power
The samples of all four NAEP data sets are large and, therefore, the power to detect misfit is generally expected to be large; however, we performed additional power analysis for the item fit residuals using the item parameters of the LTT Math Age 9 assessment. Note that this assessment consists of dichotomous items only.
 1.
Nonmonotone for low \(\theta\): \(p(Y=1\theta )=\tfrac{1}{4}\text {logit}^{1}(4.25(\theta +0.5))+\text {logit}^{1}(4.25(\theta 1))\).
 2.
Upper asymptote smaller than 1: \(p(Y=1\theta )=0.7\text {logit}^{1}(3.4(\theta +0.5))\).
 3.
Flat for mid \(\theta\): \(p(Y=1\theta )=0.55\text {logit}^{1}(5.95(\theta +1))+\text {logit}^{1}(5.95(\theta 2))\).
 4.
Wiggly, nonmonotone curve: \(p(Y=1\theta )=0.65\text {logit}^{1}(1.5\theta )+0.35\text {logit}^{1}(\sin (3\theta ))\).
Power of the itemfit residuals for LTT math age 9 simulations
Item type  Mean (−3 to 3)  Mean (−2 to 2) 

Bad item 1  74  84 
Bad item 2  96  94 
Bad item 3  86  90 
Bad item 4  52  68 
In addition, we simulated data under the 1PL, 2PL, and 3PL model and fitted the 1PL to all three data sets, the 2PL to the latter two, and the 3PL to the latter only. This set up gives us additional Type I error rates for other model types and power for the situation in which the fitted model is simpler than the datagenerating model (see e.g., Sinharay 2006; Table 1).
Type I error (diagonals) and Power (offdiagonals) of itemfit residuals for different model combinations for LTT math age 9
Itemfit residual  Datagenerating model  

Fitted model  1PL  2PL  3PL  
1PL  9  65  66  
2PL  11  27  
3PL  20 
Generalized residual analysis
For firstorder marginals, or the (weighted) proportion of students who correctly answer the dichotomous items or receive a specific score on a polytomous item, we used 25 replications for each of the four NAEP data sets. For secondorder marginals,^{3} however, we used only five replications, because the computation of these residuals for a single data set is very time consuming (several hours).
For the generalized residuals for firstorder marginals, the average Type I error rates at the 5 % level are 7 % for LTT Math Age 9 for longterm trend, 1 % for Reading Grade 12, 0 % for Math Grade 8, and 6 % for Science Grade 12, respectively. Note that most of the Type I error rates for the firstorder marginals are rather meaningless, because IRT models with itemspecific parameters should be able to predict observed item score frequencies well. For the generalized residuals for secondorder marginals, the average Type I error rates at the 5 % level are 6 % for LTT Math Age 9, 5 % for Reading Grade 12, and 6 % for Science Grade 12, respectively. Thus, the Type I error rates of the generalized residuals for the secondorder marginals are close to the nominal level, and seem to be satisfactory.
Results for the NAEP data
We fitted the 1parameter logistic (1PL) model, 2PL model, 3PL model with constant guessing (C3PL), and 3PL model to the dichotomous items and the partial credit model (PCM) and GPCM to the polytomous items to each of the abovementioned data sets. For the LTT Math Age 9, Reading Grade 12, and Math Grade 8 data, which had two assessment years, a dummy predictor was used so that population means for the two years are allowed to differ. The ETS mirt software (Haberman 2013) was used to perform all the computations, including the fitting of the IRT models and the computation of the residuals.
Relative model fit statistics (PEGH) for unidimensional (1D) and multidimensional (MD) models
Model  LTT math age 9  Reading grade 12  Math grade 8  Science grade 12 

1D 1PL/PCM  0.465  0.634  0.607  0.643 
1D 2PL/GPCM  0.456  0.629  0.601  0.636 
1D C3PL/GPCM  0.455  0.629  0.600  0.634 
1D 3PL/GPCM  0.454  0.629  0.600  0.634 
MD 3PL/GPCM  –  0.628^{a}  0.600^{b}  – 
Correlations between dimensions in fivedimensional 3PL/GPCM for math grade 8 data
2  3  4  5  

1. Number properties and operations  .97  .93  .96  .95 
2. Measurement  –  .96  .96  .94 
3. Geometry  –  .93  .92  
4. Data analysis and probability  –  .94  
5. Algebra  – 
We next summarize the results from the application of our suggested IRT modelfit tools to data from the four abovementioned NAEP assessments using the unidimensional model combinations.
Itemfit analysis using residuals
For each of the four NAEP data sets, we computed the itemfit residuals at 31 equallyspaced values of the proficiency scale between −3 and 3 for each score category (except for the lowest score category) of each item. Haberman et al. (2013) recommended the use of 31 values. Further, some limited analysis showed that the use of a different number of values does not change the conclusions. This resulted in, for example, 31 residuals for a binary item and 62 residuals for a polytomous item with three score categories.
Percent significant residuals under different unidimensional models
Residual  Model  LTT math age 9  Reading grade 12  Math grade 8  Science grade 12 

Itemfit residual  1PL/PCM  67  43  75  47 
2PL/GPCM  40  26  64  33  
C3PL/GPCM  35  27  64  28  
3PL/GPCM  28  29  64  33  
Firstorder marginal  1PL/PCM  0  0  0  0 
2PL/GPCM  0  0  0  0  
C3PL/GPCM  29  5  0  19  
3PL/GPCM  0  0  0  0  
Secondorder marginal  1PL/PCM  47  15  27  18 
2PL/GPCM  31  13  19  15  
C3PL/GPCM  31  13  19  15  
3PL/GPCM  31  14  19  15 
In the operational analysis, the number of items that were found to be misfitting and removed from the final computations were two, one, zero and six, respectively, for the four assessments.
Generalized residual analysis
The second block of Table 6 shows the percentage of statistically significant generalized residuals for the firstorder marginal without any adjustment (that is, larger than 1.96 in absolute value) for all data sets and different models. The percentages are all zero except for the C3PL/GPCM. This can be explained by the fact that all but the C3PL/GPCM have itemspecific parameters that can predict the observed proportions of item scores quite well. Only the C3PL/GPCM can have issues with this prediction, for example, if there is variation in guessing behaviors. This latter seems to be the case for LTT Math Age 9 but not for Math Grade 8.
The third block of Table 6 shows the percentage of statistically significant generalized residuals for the secondorder marginals. The percentages are considerably larger than the nominal level (also than the Type I error rates found in the simulation study) and show that the NAEP model does not adequately account for the association among the items. The misfit is most apparent for LTT math age 9.
Several generalized residuals are smaller than −10, which provides strong evidence of misfit of the IRT model to secondorder marginals.
Researcher such as Bradlow et al. (1999) noted that if the IRT model cannot account for the dependence between itempair properly, then the precision of proficiency estimates will be overestimated and showed that accounting for the dependence using, for example, the testlet model would not lead to overestimation of the precision of proficiency estimates. Their result implies that if we found too many significant generalized residuals for secondorder marginals for items belonging to common stimulus (also referred to as testlets by, for example, Bradlow et al. 1999), then application of a model like the testlet model (Bradlow et al. 1999) would lead to better fit to the NAEP data. However, we found that the proportion of significant generalized residuals for secondorder marginals for item pairs belonging to testlets is roughly the same as those for item pairs not belonging to testlets. Thus, there does not seem to be an easy way to rectify the misfit of the NAEP IRT model to the secondorder marginals.
Assessment of practical significance of misfit
To assess the practical significance of item misfit for the four assessments, we obtained the overall and subgroup means and the percentage of examinees at different proficiency levels (we considered the percentages at basic or above, and proficient or above) from the operational analysis. These quantities are reported as rounded integers in operational NAEP reports (e.g., Rampey et al. 2009). Note that these quantities were computed after omitting the items that were found misfitting in the operational analysis (2, 1, 0 and 6 such items for the four assessments). Then, for any assessment, we found the nine items that had the largest number of statistically significant itemfit residuals. For example, for the 2008 longterm trend Mathematics assessment at Age 9, nine items with respectively 19, 19, 19, 19, 18, 18, 17, 17 and 16 statistically significant itemfit residuals (out of a total of 31 each) were found.
For each assessment, we omitted scores on the nine misfitting items and ran the NAEP operational analysis to recompute the subgroup means (rounded and converted to the NAEP operational score scale) and the percentage of examinees at different proficiency levels. We compared these recomputed values to the corresponding original (and operationally reported) quantities.
Interestingly, in 48 such comparisons of means and percentages for each of the four data sets, there was no difference in 44, 36, 32 and 47 cases, respectively, for the longtermtrend, reading, math and science data sets. For example, the overall average score is 243 (on a 0500 scale) and overall percent scoring 200 or above is 89 in both of these analyses for the 2008 longtermtrend Mathematics assessment at age 9. In the cases when there was a difference, the difference was one in absolute value. For example, the operationally reported overall percent at 250 or above is 44 while the percent at 250 or above after removing 9 misfitting items is 45 for the 2008 longtermtrend Mathematics assessment at age 9.
Thus, the practical significance of the item misfit seems to be negligible for the four data sets.
Conclusions
The focus of this paper was on the assessment of misfit of the IRT model used in largescale survey assessments such as NAEP using data from four NAEP assessments. Two sets of recently suggested modelfit tools, the itemfit residuals (Bock and Haberman 2009; Haberman et al. 2013) and generalized residuals (Haberman and Sinharay 2013), were modified for application to NAEP data.
Keeping in mind the importance of NAEP in educational policymaking in the U.S., this paper promises to make a significant contribution by performing a rigorous check of the fit of the NAEP model. Replacement of the current NAEP itemfit procedure by our suggested procedure would make the NAEP statistical toolkit more rigorous. Because several other assessments such as IALS, TIMSS and PIRLS use essentially the same statistical model as in NAEP, the findings of this paper will be relevant to those assessments as well.
An important finding in this paper is that statistically significant misfit (in the form of significant residuals) was found for all the data sets. This finding concurs with the statement of George Box that all models are wrong (Box and Draper 1987, p. 74) and a similar statement of (Lord and Novick 1968, p. 383). However, the observed misfit was not practically significant for any of the data sets. For example, the itemfit residuals were statistically significant for several items, but the removal of some of these items led to negligible differences in the reported outcomes such as subgroup means and percentages at different proficiency levels. Therefore, the NAEP operational model seems to be useful though it is “wrong” (in the sense that the model was found misfitting to the NAEP data using the suggested residuals) from the viewpoint of George Box. It is possible that the lack of practical significance of the misfit is due to the thorough test development and review procedures used in NAEP, which may filter out any serious IRTmodelfit issues. The finding of the lack of practical significance of the misfit is similar to the finding in Sinharay and Haberman (2014) that the misfit of the operational IRT model used in several largescale highstakes tests is not significant.
Several issues can be examined in future research. First, one could apply our suggested methods to data sets from other largescale educational survey assessments such as TIMSS, PIRLS, and IALS. Second, Haberman et al. (2013) provided detailed simulation results demonstrating that the Type I error rates of their itemfit residuals in regular IRT applications are quite close to the nominal level as the sample size increases and those results are expected to hold for our suggested itemfit residuals (that are extensions of the residuals of Haberman et al. 2013) as well, but it is possible to perform simulation studies to verify that. It is also possible to perform simulations to find out the extent of model misfit that would be practically significant. Third, we studied the practical consequences of item misfit in this paper; it is possible in future research to study the practical consequences of multidimensionality; for example, there is a close relationship between DIF and multidimensionality (e.g., Camilli 1992) and it would be of interest to study the practical consequences of multidimensionality on DIF. Fourth, it is possible to further explore the reasons of the firstorder marginal not being useful in our analysis. Sinharay et al. (2011) also found the generalized residuals of Haberman and Sinharay (2013) for the firstorder marginal to be not useful in assessing the fit of regular IRT models. These residuals might be more useful to detect differential item functioning (DIF). For example, the generalized residuals for the firstorder marginals for males and females can be used to study genderbased DIF (although the Type I error might be low, the power would be larger). Finally, several students taking the NAEP, especially those in twelfth grade, lack motivation (e.g., Pellegrino et al. 1999). It would be interesting to examine whether that lack of motivation affects the model fit in any manner.
The software does not employ a finite population correction that is typically used when the sampling is without replacement, as in NAEP—this is a possible area of future research. It is anticipated that the finite population correction would not affect our results because of large sample sizes in NAEP.
or the weighted proportion of students who correctly answer a pair of dichotomous items or receive a specific pair of scores on a pair of items one of which is polytomous.
Declarations
Authors’ contributions
PWVR carried out most of the computations and wrote a major part of the manuscript. SS wrote the first draft of the manuscript and performed some of the computations. SJH suggested the mathematical results. MSJ wrote some parts of the manuscript and performed some computations. All authors read and approved the final manuscript.
Acknowledgements
The authors thank the editor Matthias von Davier and the two anonymous reviewers for helpful comments. The research reported here was partially supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D120006 to Educational Testing Service as part of the Statistical and Research Methodology in Education Initiative.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Adams, R. J., Wilson, M. R., & Wang, W. C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–23.View ArticleGoogle Scholar
 Allen, N. A., Donoghue, J. R., & Schoeps, T. L. (2001). The NAEP 1998 technical report (NCES 2001452). Washington, DC: United States Department of Education, Institute of Education Sciences, Department of Education, Office for Educational Research and Improvement.Google Scholar
 American Association of Educational Research, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar
 Beaton, A. E. (1987). Implementing the new design: The NAEP 1983–84 technical report (Tech. Rep. No 15TR20). Princeton, NJ: ETS.Google Scholar
 Beaton, A. E. (2003). A procedure for testing the fit of IRT models for special populations: Draft. Unpublished manuscript.Google Scholar
 Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–479). Reading: AddisonWesley.Google Scholar
 Bock, R. D., & Haberman, S. J. (2009) Confidence bands for examining goodnessoffit of estimated item response functions. Paper presented at the annual meeting of the Psychometric Society, Cambridge, UK.Google Scholar
 Box, G. E. P., & Draper, N. R. (1987). Empirical modelbuilding and response surfaces. New York, NY: Wiley.Google Scholar
 Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.View ArticleGoogle Scholar
 Camilli, G. (1992). A conceptual analysis of differential item functioning in terms of a multidimensional item response model. Applied Psychological Measurement, 16, 129–147.View ArticleGoogle Scholar
 Cochran, W. G. (1977). Sampling techniques (3rd ed.). New York: Wiley.Google Scholar
 Debeer, D., & Janssen, R. (2013). Modeling itemposition effects within an IRT framework. Journal of Educational Measurement, 50, 164–185.View ArticleGoogle Scholar
 Dresher, A. R., & Thind, S. K. (2007). Examination of item fit for individual jurisdictions in NAEP. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL.Google Scholar
 du Toit, M. (2003). IRT from SSI. Lincolnwood, IL: Scientific Software International.Google Scholar
 Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman and Hall.View ArticleGoogle Scholar
 Gilula, Z., & Haberman, S. J. (1995). Prediction functions for categorical panel data. The Annals of Statistics, 23, 1130–1142.View ArticleGoogle Scholar
 Gilula, Z., & Haberman, S. J. (1994). Models for analyzing categorical panel data. Journal of the American Statistical Association, 89, 645–656.View ArticleGoogle Scholar
 Haberman, S. J. (2009). Use of generalized residuals to examine goodness of fit of item response models (ETS Research Report RR0915). Princeton: ETS.Google Scholar
 Haberman, S. J. (2013). A general program for itemresponse analysis that employs the stabilized NewtonRaphson algorithm (ETS Research Report RR1332). Princeton: ETS.Google Scholar
 Haberman, S. J., & Sinharay, S. (2013). Generalized residuals for general models for contingency tables with application to item response theory. Journal of American Statistical Association, 108, 1435–1444.View ArticleGoogle Scholar
 Haberman, S. J., Sinharay, S., & Chon, K. H. (2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78, 417–440.View ArticleGoogle Scholar
 Kirsch, I. S. (2001). The International Adult Literacy Survey (IALS): Understanding what was measured (ETS Research Report RR0125). Princeton: ETS.Google Scholar
 Li, J. (2005) The effect of accommodations for students with disabilities: An item fit analysis. Paper presented at the Annual meeting of the National Council of Measurement in Education, Montreal, CA.Google Scholar
 Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading: Addison Wesley.Google Scholar
 Martin, M. O., & Kelly, D. L. (1996). Third international mathematics and science study technical report volume 1: Design and development. Chestnut Hill: Boston College.Google Scholar
 Mislevy, R. J., Johnson, E. G., & Muraki, E. (1992). Scaling procedures in NAEP. Journal of Educational Statistics, 17, 131–154.View ArticleGoogle Scholar
 Mullis, I., Martin, M., & Gonzalez, E. (2003). 2003 PIRLS 2001 international report: IEA’s study of reading literacy achievement in primary schools,. Chestnut Hill: Boston College.Google Scholar
 Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.View ArticleGoogle Scholar
 National Center for Education Statistics. (2009). The nations report card: Mathematics 2009 (Tech. Rep. No. NCES 2010451). Washington, DC: Institute of Education Sciences, U.S. Department of Education.Google Scholar
 National Center for Education Statistics. (2011). The nations report card: Science 2009 (Tech. Rep. No. NCES 2011451). Washington, DC: Institute of Education Sciences, U.S. Department of Education.Google Scholar
 Orlando, M., & Thissen, D. (2000). Likelihoodbased itemfit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64.View ArticleGoogle Scholar
 Pellegrino, J. W., Jones, L. R., & Mitchell, K. J. (1999). Grading the nation’s report card: Evaluating NAEP and transforming the assessment of educational progress. Washington, DC: National Academy Press.Google Scholar
 Perie, M., Grigg, W., & Donahue, P. (2005). The nation’s report card: Reading 2005 (Tech. Rep. No. NCES 2006451). Washington, DC: U.S. Government Printing Office: U.S. Department of Education, National Center for Education Statistics.Google Scholar
 Rampey, B. D., Dion, G. S., & Donahue, P. L. (2009). NAEP 2008 trends in academic progress (Tech. Rep. No. NCES 2009479). Washington, DC: National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education.Google Scholar
 Rogers, A., Gregory, K., Davis, S., Kulick, E. (2006). Users guide to NAEP modelbased pvalue programs. Unpublished manuscript. Princeton: ETS.Google Scholar
 Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory models. British Journal of Mathematical and Statistical Psychology, 59, 429–449.View ArticleGoogle Scholar
 Sinharay, S., Guo, Z., von Davier, M., & Veldkamp, B. P. (2010). Assessing fit of latent regression models. IERI Monograph Series, 3, 35–55.Google Scholar
 Sinharay, S., & Haberman, S. J. (2014). How often is the misfit of item response theory models practically significant? Educational Measurement: Issues and practice, 33(1), 23–35.View ArticleGoogle Scholar
 Sinharay, S., Haberman, S. J., & Jia, H. (2011). Fit of item response theory models: A survey of data from several operational tests (ETS Research Report No. RR1129). Princeton: ETS.Google Scholar
 Von Davier, M., & Sinharay, S. (2014). Analytics in international largescale assessments: item response theory and population models. In L. Rutkowski, M. Von Davier, & D. Rutkowski (Eds.), Handbook of international largescale assessment: background, technical issues, and methods of data analysis (pp. 155–174). Boca Raton: CRC.Google Scholar