Assuming measurement invariance of background indicators in international comparative educational achievement studies: a challenge for the interpretation of achievement differences
 Heike Wendt^{1},
 Daniel Kasper^{1}Email author and
 Matthias Trendtel^{2}
DOI: 10.1186/s4053601700439
© The Author(s) 2017
Received: 14 August 2015
Accepted: 9 February 2017
Published: 16 March 2017
Abstract
Background
Largescale crossnational studies designed to measure student achievement use different social, cultural, economic and other background variables to explain observed differences in that achievement. Prior to their inclusion into a prediction model, these variables are commonly scaled into latent background indices. To allow crossnational comparisons of the latent indices, measurement invariance is assumed. However, it is unclear whether the assumption of measurement invariance has some influence on the results of the prediction model, thus challenging the reliability and validity of crossnational comparisons of predicted results.
Methods
To establish the effect size attributed to different degrees of measurement invariance, we rescaled the ‘home resource for learning index’ (HRL) for the 37 countries (\(n=166,709\) students) that participated in the IEA’s combined ‘Progress in International Reading Literacy Study’ (PIRLS) and ‘Trends in International Mathematics and Science Study’ (TIMSS) assessments of 2011. We used (a) two different measurement models [oneparameter model (1PL) and twoparameter model (2PL)] with (b) two different degrees of measurement invariance, resulting in four different models. We introduced the different HRL indices as predictors in a generalized linear mixed model (GLMM) with mathematics achievement as the dependent variable. We then compared three outcomes across countries and by scaling model: (1) the differing fitvalues of the measurement models, (2) the estimated discrimination parameters, and (3) the estimated regression coefficients.
Results
The least restrictive measurement model fitted the data best, and the degree of assumed measurement invariance of the HRL indices influenced the random effects of the GLMM in all but one country. For onethird of the countries, the fixed effects of the GLMM also related to the degree of assumed measurement invariance.
Conclusion
The results support the use of countryspecific measurement models for scaling the HRL index. In general, equating procedures could be used for crossnational comparisons of the latent indices when countryspecific measurement models are fitted. Crossnational comparisons of the coefficients of the GLMM should take into account the applied measurement model for scaling the HRL indices. This process could be achieved by, for example, adjusting the standard errors of the coefficients.
Keywords
PIRLS/TIMSS combined Invariance background models Measurement and prediction invariance Generalized linear mixed model Sensitivity analyses for variance componentsBackground
Introduction
In order to report international trends in educational achievement over time and to compare achievement results across countries, the International Association for the Evaluation of Educational Achievement (IEA) conducts, among other studies, regular iterations of the Progress in International Reading Literacy Study (PIRLS) and the Trends in International Mathematics and Science Study (TIMSS). PIRLS has assessed the reading comprehension achievement of fourthgrade students every 5 years since 2001 (Mullis et al. 2012a), while TIMSS has assessed the mathematics and science achievement of fourth and eighthgrade students every 4 years since 1995 (Martin et al. 2012). In 2011, IEA conducted both studies jointly for the first time. Thirtyfour countries and three benchmark participants collected data on Grade 4 students’ educational achievement in three competence domains: reading comprehension, mathematics, and science (Martin and Mullis 2013).
In their efforts to explain observed achievement differences in the data from largescale assessment studies, researchers have increasingly combined different background indicators (Bos et al. 2012; Martin et al. 2008; Mullis et al. 2007, 2008; OECD 2014a) by scaling them into latent background variables. Scaling these variables usually requires application of an item response theory (IRT) model (Martin and Mullis 2012; OECD 2014b). The approach has several advantages, among which is the ability to control the measurement errors in the manifest variables. Controlling for measurement error is especially important in educational research studies because the multilevel prediction models commonly used in this area are very sensitive to these errors (Lüdtke et al. 2011).
Although using IRT models to scale latent background variables before including them in a prediction model works very well in largescale assessment studies, the method presents several challenges (van den HeuvelPanhuizen et al. 2009). First researchers wanting to use latent indices instead of manifest indicators need to develop a coherent theoretical framework for the construct they intend to measure. Second, they need to define the assessment’s desired target population and the sampling procedure. Third, they need to choose not only a suitable measurement model for the construct but also a statistical model that will allow them to scale the latent indices according to this model. Finally, they must specify a useful and appropriate prediction model.
These tasks also need to be considered within the context of two central challenges that researchers face when conducting crossnational studies of educational achievement. The first centers on the need to ensure that the indices used for international comparison are comparable across the countries participating in each study (Nagengast and Marsh 2013), and the second concerns the need to ensure that the latent variables are comparable across the participating countries. Researchers conducting these largescale assessment studies usually endeavor to meet these challenges by assuming measurement invariance across countries when they scale the latent indices. However, as work by Millsap (1995, 1997, 1998, 2007) shows, this approach leads to inconsistent measurement invariance and predictive invariance. Thus, when researchers assume that there will be measurement invariance across countries and then, during data analysis, use the scaled latent indices as predictors in the countryspecific prediction models, the prediction coefficients across countries will only be the same under very restricted conditions. However, researchers are unlikely to deem these conditions reasonable in practice. What is obvious here is that the different decisions that those designing largescale assessment studies must make before latent indices can be used, will influence the results of these studies. Generalizability theory calls these sources of influence facets or dimensions, and emphasizes that researchers must take the variance in those research results that can be traced back to these dimensions into account before they attempt to generalize the results (Brennan 2001).
The aim of the study presented in this paper was to investigate the extent to which the assumption of crossnational measurement invariance of latent background variables affected the results of prediction models that use these indices as predictors in largescale assessment studies. To achieve this aim, we reanalyzed the PIRLS/TIMSS 2011 data that Martin et al. (2013) used in their study on effective school environment. We considered this study especially useful for the desired purpose because Martin and colleagues used latent indices scaled under the assumption of crossnational measurement invariance as predictors in their countryspecific hierarchical linear models and then compared the results of these models across the countries. We considered that reanalyzing these data sets by allowing different degrees of crossnational measurement invariance could help to answer the question of whether this assumption has an influence on (1) the crossnational comparisons performed by Martin et al. (2013) in particular, and (2) the results of largescale assessment studies that use a design comparable to the one Martin and his colleagues employed in general. We begin by providing a summary of the study by Martin and his colleagues (2013). We then describe how we conducted our study, before presenting the results from that study and a discussion of those findings.
Assessment of Martin et al.’s study
Overview
Martin et al. (2013) performed a “school effectiveness” analysis of data from the 37 countries that participated in PIRLS/TIMSS 2011. According to Martin and his colleagues (2013, p. 111) “School effectiveness analyses seek to improve educational practice by studying what makes for a successful school beyond having a student body where most of the students are from advantaged socioeconomic backgrounds.” In their analysis, Martin et al. used five school effectiveness variables and two student home background variables as predictors in the countryspecific hierarchical linear models. They used students’ achievement scores (reading comprehension, mathematics achievement, science achievement) as dependent variables. Because the goal of the study was to “present an analytic framework that could provide an overview of how these relationships vary across countries”, (Martin et al. 2013, p. 110) the results from the hierarchical linear modeling could be assumed to be comparable across the participating countries.
One of the major findings of the study by Martin et al. (2013) was that the strength of the relationships between the school effectiveness variables and the student achievement scores decreased substantially in nearly all 37 countries when Martin et al. included the home background control variables in their models; countryspecific effects were also apparent. For example, in 15 countries, only one out of the five effectiveness indicators still presented a statistically significant prediction coefficient after Martin and his colleagues had controlled for students’ home background. In four countries, three prediction coefficients remained significant. If the results of these analyses were, in fact, comparable across countries, in most countries the strength of the relationships between school effectiveness variables and student achievement should be relatively weak after controlling for student home background.
However, by scaling the school effectiveness variables and the home background variables as latent variables, Martin et al. (2013) assumed measurement invariance across countries (see the next section). Thus, it is also possible that the crossnational variation of the prediction coefficients of the school effectiveness variables and the home background variables was at least partially a methodological artifact due to the general inconsistency of measurement invariance and predictive invariance. Studying the relationship between assumed measurement invariance and the observed prediction coefficients more closely therefore seems worthwhile. We accordingly decided that reanalyzing one of the data sets that Martin et al. (2013) used would be a useful exercise. We determined we could rescale one of the home background control variables (the “home resources for learning scale”, hereafter HRL) while assuming different degrees of crossnational measurement invariance. We could then, in an effort to explain students’ mathematics achievement, introduce the rescaled variable as a predictor in a generalized linear mixed model (GLMM).
We considered the reduction in our reanalysis to only one independent variable out of the eight and one dependent variable out of the three that Martin et al. (2013) used would lead to a valuable reduction of complexity, particularly given that no other study has yet analyzed the relationship between measurement invariance and predictive invariance in largescale assessment study data. Therefore, nothing is known about possible interaction or compensatory effects in situations where the relationship between measurement invariance and predictive invariance affects more than one latent variable. We believed a reduced model would consequently increase the likelihood of finding such effects in the PIRLS/TIMSS data sets.
Also, because the selection of the HRL indices is somewhat arbitrary, we decided it would make sense to concentrate on the HRL variable. Many largescale assessment studies have shown that the crossnational assumption of measurement invariance is unlikely to hold for social background variables see, for example, (Caro and SandovalHernandez 2012; Hansson and Gustafsson 2013; Lakin 2012). Therefore, rescaling the HRL in a way that assumes measurement noninvariance would be consistent with the findings of this prior research. In addition, it is plausible to assume that indicators of the HRL indices will show countryspecific characteristics. For example, the indicator “students have own room at home” could, in some countries, be a very important indicator with respect to differentiating students with many home resources from students with only a few home resources. However, in most of the countries participating in PIRLS/TIMSS 2011, this indicator was unlikely to be a strong one because nearly all of the students had their own room at home. In terms of the IRT approach, this indicator should therefore show crossnational variation in the discrimination parameter.
It is useful at this point to outline the procedures on which the study of Martin et al. (2013) was based, especially those used to scale the HRL indices. This explanation may seem unnecessary given the wealth of literature on IRT models, but we consider it necessary for two reasons. First, Martin et al. did not explicitly use the term measurement invariance in their report. We are therefore left with the notion that they simply assumed there was measurement invariance. Second, a clear description of the scaling model they used is required to illustrate why we deemed it necessary to use a modified version of this model in our reanalysis. We also considered it necessary to introduce the prediction model.
Scaling procedures used to develop the HRL index
Martin et al. (2013) used, as indictors for the HRL index, three items from the PIRLS 2011 home questionnaire (the “Learning to Read Survey”) given to the parents of the students who participated in the study, and two items from the PIRLS 2011 student questionnaire. The home questionnaire items were “number of children’s books in the home,” “highest level of education of either parent,” and “highest level of occupation of either parent.” The student questionnaire items were “number of books in the home” and “number of home study supports” (see Table 1). The PIRLS and TIMSS studies use these items as indicators of the economic and cultural capital of students’ families (Mullis et al. 2012b). The positive association between these indicators and student achievement are evident in many of the reported findings from largescale studies of educational achievement (see, for example, Martin et al. 2008, 2012; Mullis et al. 2007, 2008, 2012a; OECD 2014a). In line with Bourdieu ’s (1986) work on cultural capital, the HRL index can thus be interpreted as a measure of students’ socioeconomic and cultural home learning environments (Smith et al. 2016).
The prediction model used to explain student achievement
We should mention, however, that Martin et al. (2013) only includes the random effects when there was significant variation in the relationship between the WLEs and achievement across schools and only when they could estimate this relationship reliably. Furthermore, they usually used the variance components \(\sigma ^2_{\alpha }=\text{var}(\alpha )\) and not the coefficients for \(\alpha\) to estimate these effects. In addition, because Martin et al. used plausible values for y, they performed all analyses five times and averaged the results according to Rubin’s formulas (Rubin 1987).
Comments on Martin and colleagues’ procedures
In order to address the challenges identified above, the construct underlying the HRL index needed to be based on a coherent and robust theoretical framework. Such a framework can indeed be derived by drawing on various conceptualizations of capital (Bourdieu 1986; Coleman 1988). However, because the HRL index drew on only five indicators (from the many available), it was very narrowly defined. We consider that the index would have particularly benefited from inclusion of the more reliable and valid indicators of social reproduction (Caro et al. 2014). Martin et al.’s (2013) assumption of measurement invariance also merits consideration for two reasons. First, because crossnational and comparative research in various disciplines challenges the validity of this assumption (Çetin 2010; Caro et al. 2014; Hansson and Gustafsson 2013; Schulte et al. 2013; Schulz 2005; Segeritz and Pant 2013). We assumed that at least some of the HRL indicators would show differential item functioning across the participating countries. For example, having an internet connection and/or a room of one’s own may be more discriminating indicators of social status among students in southern or eastern European countries than among students in central European countries. Also, it seems prudent to conceptualize highest level of occupation of either parent in terms of the characteristics of each country. For example, a small business ownership might represent high social status in some countries but denote a broader category representing both lower and middle social status in other countries. These considerations suggest that the apparent lack of research studies on the invariance of the HRL index across countries needs to be remedied.
The second reason why critiquing the assumption of measurement invariance is critical relates to the general inconsistency of measurement invariance and predictive invariance shown in the work by Millsap (1995, 1997, 1998, 2007). Assuming that the HRL index presents no measurement invariance across countries, then the implication of that assumption is that the variance of the coefficients of the hierarchical linear model across countries is a purely methodological artifact. In addition, where this methodological variance does exist, then, according to generalizability theory (Brennan 2001) it should be added to the actual variance of the coefficients across countries (by, for example, increasing the standard errors of the coefficients). However, enacting this proviso is difficult because the size of the effect between measurement invariance and predictive invariance is presently unclear. The same can be said of the relationships between different degrees of measurement invariance, different measurement models, and other (more general) prediction models.

To what extent can measurement invariance across participating countries be assumed for the HRL index of Grade 4 students assessed in the combined PIRLS and TIMSS studies of 2011?

If the assumption of measurement invariance does not hold, to what extent do countryspecific measurement models differ?

Is there an effect of different degrees of measurement invariance on the parameter estimates of the prediction model?

If there is an effect, how large is it?
We began by addressing the first research question. Here, we fitted two different measurement models with two different degrees of measurement invariance to the combined data and then used wellestablished fit criteria to compare the resulting models. To answer the second research question, we compared the discrimination parameters of the measurement models across countries, a procedure that allowed us to derive the countryspecific measurement validity of the indicators. In order to answer the third research question, we introduced the different HRL indices as predictors in generalized linear mixed models (GLMMs) where mathematics achievement was the dependent variable. By comparing the regression coefficients across countries and across different measurement models, we were able to observe both the overall effect of different degrees of measurement invariance on the prediction coefficients and the countryspecific effect on the coefficients. We also analyzed the variance component, that is, the random part of the hierarchical linear model, in the same manner as we analyzed the regression coefficients. A fuller explanation of how we conducted our analyses follows.
Methods
Data
We used the combined international data sets for all countries participating in PIRLS/TIMSS 2011.^{1} We then drew from these data sets, the countryspecific data files named ASG***B1 and ASH***B1: *** stands for a countryspecific code, ASG are the fourthgrade student background data sets and ASH are the corresponding home background data sets.^{2} Our next step was to merge the different data sets, first according to countries and then according to data resources. This process resulted in a dataset that included the student background data and the home background data for \(n=166,709\) Grade 4 students across 37 participating countries.
Scaling procedure
 1.
Model 1: In this model, all discrimination parameters \(\alpha _{gi}=c\) and all \(\tau _{gki}=\tau _{ki}\) were held constant both between the items and across the countries whereas the threshold parameters were allowed to vary between items but remain constant across countries. This model was the same as the one used by Martin et al. (2013).
 2.
Model 2: In contrast to Model 1, the discrimination parameters \(\alpha _{gi}=c_g\) were held constant between the items but allowed to vary across the countries. However, the assumptions for the threshold structure were the same as those for Model 1.
 3.
Model 3: Here, discrimination parameters \(\alpha _{gi}=c_i\) were allowed to vary between the items but were held constant across the countries. Again, the threshold structure remained unchanged.
 4.
Model 4: All discrimination parameters \(\alpha _{gi}=c_{gi}\) were allowed to vary both between the items and across the countries. As before, the threshold structure remained unchanged.
Prediction model
Outcomes
Scaling models
Prediction model
Dealing with missing values, weighting and software
We used a Markov chain Monte Carlo (MCMC) method to impute missing values in the indicators of the HRL indices. The imputation model included all indicators of the HRL indices and the plausible values of mathematics achievement, and so produced five complete data sets. Of course, a fully nested imputation strategy would have resulted in 25 imputed data sets (e.g., for each plausible value, five imputed data sets). However, because Martin et al. (2013) applied only a single imputation strategy (which seemed to us an inaccurate approach of conducting an analysis involving analysis of the variance), an increase from 1 to 25 imputations would have made it impossible to compare the results of this current paper with Martin and colleagues’ results. Every analysis in our study was performed once for every completed dataset, and then the results were averaged according to Rubin’s (1987) formula. Senwgt was used as the weighting variable for the scaling models. Senwgt summed up to a total sample size of students \(n_g=500\) for every country and so led to the equal weighting of the countries in the scaling process. The GLMM analysis, however, uses houwgt, which sums up to the observed sample size of students for every country. Unless we state otherwise in this paper, all the analyses in our study were generated by way of Statistical Analysis System (SAS) software, Version 9.4 (TS1M1) of the SAS System for Windows.^{3} We used the procedure MI to carry out the multiple imputations, the procedure IRT to scale the HRL index, the procedure GLIMMIX for the GLMM analysis, and the procedure CALIS for the structural equation models. We used the IMLmodule insight of SAS to implement the derived test statistics.
Results
Descriptive statistics
Items of the home resources for learning scale (fourth grade) and percentage of yes responses overall countries (n = 138,103)
Item  Response option  % yes  SE 

Number of books in the home (students)  0–10  17.1  0.24 
11–25  25.9  0.21  
26–100  31.5  0.20  
101–200  14.0  0.15  
More than 200  11.5  0.17  
Number of home study supports (students)  None  12.8  0.18 
Internet connection or own room  36.9  0.19  
Both  50.3  0.23  
Number of children’s books in the home (parents)  0–10  24.2  0.20 
11–25  21.9  0.17  
26–50  24.7  0.16  
51–100  17.4  0.14  
More than 100  11.9  0.14  
Highest level of education of either parent (parents)  Finished some primary or lower secondary or did not go to school  7.9  0.18 
Finished lower secondary  13.6  0.17  
Finished upper secondary  31.4  0.22  
Finished postsecondary education  19.1  0.16  
Finished university or higher  28.0  0.28  
Highest level of occupation of either parent (parents)  Has never worked outside home for pay, general laborer, or semiprofessional (skilled agricultural or fishery worker, craft or trade worker, plant or machine operator)  27.7  0.21 
Clerical (clerk or service or sales worker)  25.8  0.16  
Small business owner  13.1  0.14  
Professional (corporate manager or senior official, professional, or technician or associate professional)  33.4  0.23 
Descriptive statistics for mathematics achievement, early literacy/numeracy tasks, and home resources for learning index under different scaling models
Country  Variable  

N  MAT  ET  HRL  
RP  M1  M2  M3  M4  
ST  SL  M  SE  M  SE  M  SE  M  SE  M  SE  M  SE  M  SE  
Azerbaijan  4871  169  458.1  6.05  9.5  0.08  9.0  0.03  −0.7  0.02  −0.7  0.02  −0.6  0.02  −0.7  0.01 
Australia  5943  280  514.2  2.94  9.4  0.03  11.2  0.03  0.6  0.02  0.6  0.02  0.6  0.02  0.5  0.02 
Austria  4587  158  505.2  2.65  9.3  0.03  10.5  0.05  0.3  0.03  0.2  0.03  0.3  0.03  0.1  0.03 
Chinese Taipei  4265  150  590.3  1.86  11.2  0.02  10.3  0.05  0.2  0.03  0.1  0.03  0.2  0.03  −0.0  0.03 
Croatia  4545  152  486.1  1.89  10.5  0.03  10.0  0.04  −0.0  0.02  −0.2  0.02  −0.0  0.02  −0.2  0.03 
Czech Republic  4433  177  508.3  2.48  9.9  0.03  10.6  0.04  0.3  0.02  0.2  0.02  0.3  0.02  0.1  0.03 
Finland  4541  145  543.5  2.33  10.4  0.04  11.2  0.03  0.7  0.02  0.6  0.02  0.6  0.02  0.5  0.02 
Georgia  4774  173  444.2  3.54  9.8  0.05  10.1  0.05  −0.0  0.03  −0.1  0.03  0.0  0.03  −0.2  0.03 
Germany  3928  197  524.8  2.21  9.5  0.03  10.5  0.05  0.3  0.03  0.2  0.03  0.3  0.03  −0.0  0.03 
Honduras  3830  147  388.7  5.71  10.7  0.05  8.1  0.07  −1.1  0.04  −0.7  0.02  −1.1  0.04  −0.7  0.02 
Hungary  5149  149  512.4  3.40  9.3  0.03  10.1  0.07  0.1  0.04  −0.0  0.04  0.0  0.04  −0.2  0.04 
Iran, Islamic Rep. of  5734  244  423.5  3.53  9.6  0.06  8.7  0.07  −0.7  0.04  −0.6  0.03  −0.7  0.04  −0.6  0.03 
Ireland  4383  150  525.8  2.82  9.4  0.03  10.8  0.05  0.4  0.03  0.3  0.03  0.4  0.03  0.3  0.03 
Italy  4125  202  503.8  2.71  9.2  0.02  9.9  0.04  −0.1  0.02  −0.2  0.02  −0.1  0.02  −0.3  0.02 
Lithuania  4584  154  531.5  2.63  10.1  0.04  10.1  0.04  0.0  0.02  −0.1  0.02  0.0  0.02  −0.1  0.02 
Malta  3492  96  491.8  1.27  10.2  0.03  10.3  0.02  0.2  0.01  0.1  0.01  0.1  0.01  −0.4  0.01 
Morocco  7614  284  321.9  4.02  9.7  0.08  8.2  0.04  −1.0  0.02  −0.8  0.02  −1.0  0.02  −0.8  0.02 
Oman  10,237  327  376.5  2.94  10.6  0.03  9.2  0.03  −0.5  0.02  −0.6  0.01  −0.5  0.02  −0.6  0.01 
Poland  4962  150  477.0  2.26  9.9  0.04  10.1  0.05  0.1  0.03  −0.1  0.03  0.0  0.03  −0.2  0.03 
Qatar  4104  166  406.4  3.46  10.8  0.03  10.4  0.04  0.2  0.02  0.1  0.02  0.2  0.02  0.3  0.02 
Romania  4643  148  476.4  6.01  9.6  0.10  9.2  0.06  −0.5  0.03  −0.5  0.03  −0.5  0.03  −0.5  0.03 
Saudi Arabia  4470  171  403.0  5.31  10.6  0.08  9.5  0.06  −0.4  0.03  −0.4  0.03  −0.3  0.03  −0.4  0.03 
Singapore  6208  176  606.4  3.35  11.3  0.04  10.8  0.03  0.4  0.02  0.3  0.02  0.4  0.02  0.3  0.02 
Slovak Republic  5561  197  503.1  3.93  9.0  0.04  10.1  0.05  0.1  0.03  −0.1  0.03  0.1  0.03  −0.1  0.03 
Slovenia  4433  195  509.8  1.99  9.3  0.03  10.5  0.03  0.3  0.02  0.2  0.02  0.3  0.02  0.1  0.02 
Spain  4105  151  478.8  2.81  10.6  0.04  10.3  0.05  0.2  0.03  0.1  0.03  0.2  0.03  −0.1  0.03 
Sweden  4482  152  502.0  2.20  10.3  0.04  11.2  0.04  0.7  0.02  0.6  0.02  0.7  0.02  0.5  0.02 
Abu Dhabi, UAE  4100  164  409.7  4.91  10.5  0.04  10.1  0.06  −0.0  0.03  −0.1  0.03  0.0  0.03  0.1  0.03 
Correlations between the different HRL indices and mathematics achievement of fourthgrade students (average values across countries)
Variable  M1  M2  M3  M4  MAT 

RP  0.99  0.99  1.00  0.97  0.41 
M1  1.00  0.99  0.95  0.40  
M2  0.99  0.95  0.40  
M3  0.97  0.41  
M4  0.40 
When we compare the average values on the different HRL indices across the scaling models, we observed, on average, only small changes between the different indices per country. However, some noteworthy exceptions were apparent. These included changes of around 0.3 points for Germany, Honduras, Hungary, and Poland. Hence, for these countries, the influence of the scaling model on the average HRL indices was approximately onethird of a standard deviation of this index. For Malta, the influence of the scaling model on the average HRL indices was even more pronounced, at approximately twothirds of a standard deviation of the HRL index.
Scaling models
Model fit statistics for the partial credit model of the HRL index
Fitstatistics  Model  

1  2  3  4  
Log likelihood  −90,588.33  −89,914.61  −90,360.41  −88,523.81 
AIC (smaller is better)  181,212.66  179,919.22  180,764.82  177,361.61 
BIC (smaller is better)  181,389.71  180,361.83  180,981.21  178,905.83 
Distribution of slope parameters \(c_g\), \(c_i\) and \(c_{gi}\) for the indicators of the HRL index
Item  Country  

1  2  3  4  5  6  7  8  9  10  
\(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  
Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  
ASBG04  1.08  0.10  3.09  0.20  0.99  0.10  1.86  0.12  1.32  0.11  1.20  0.11  1.58  0.12  1.15  0.11  1.05  0.11  1.40  0.12 
ASDG05  0.92  0.14  2.23  0.14  0.52  0.11  2.68  0.16  0.91  0.15  0.50  0.12  0.47  0.11  0.21  0.10  0.42  0.12  1.27  0.13 
ASBH15  1.58  0.12  3.43  0.23  0.97  0.10  2.42  0.16  1.90  0.13  1.41  0.12  1.90  0.14  1.25  0.11  0.98  0.10  1.42  0.12 
ASDHED  1.98  0.16  6.15  0.30  2.20  0.14  0.78  0.11  1.40  0.13  1.14  0.12  1.63  0.12  1.33  0.11  1.20  0.12  0.97  0.11 
ASDHOC  1.86  0.17  2.55  0.16  1.83  0.15  2.73  0.20  1.27  0.13  1.17  0.13  1.78  0.16  1.77  0.15  1.79  0.17  2.38  0.21 
\(c_g\)  1.47  0.07  3.05  0.12  1.14  0.06  1.83  0.08  1.43  0.07  1.07  0.06  1.40  0.07  1.02  0.06  1.03  0.06  1.35  0.06 
Item  Country  

11  12  13  14  15  16  17  18  19  20  
\(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  
Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  
ASBG04  1.06  0.10  1.38  0.11  2.38  0.14  1.21  0.11  1.03  0.10  0.50  0.10  3.00  0.17  1.13  0.11  0.89  0.09  0.13  0.11 
ASDG05  0.40  0.11  0.66  0.11  2.16  0.13  0.56  0.11  0.82  0.12  0.12  0.10  2.89  0.15  1.94  0.14  0.78  0.11  0.21  0.11 
ASBH15  1.03  0.11  1.40  0.11  2.42  0.15  1.53  0.12  1.22  0.11  0.02  0.11  2.92  0.17  1.95  0.15  0.89  0.09  0.08  0.12 
ASDHED  2.84  0.19  4.27  0.24  4.22  0.21  2.12  0.16  1.46  0.12  3.15  0.17  5.01  0.23  3.44  0.19  3.09  0.17  3.83  0.28 
ASDHOC  1.34  0.12  2.46  0.18  2.69  0.18  2.03  0.18  2.27  0.19  1.86  0.14  2.17  0.14  1.67  0.13  2.58  0.19  2.07  0.18 
\(c_g\)  1.26  0.06  1.74  0.08  2.54  0.10  1.44  0.07  1.20  0.06  1.00  0.06  3.00  0.09  1.83  0.08  1.37  0.06  1.06  0.06 
Item  Country  

21  22  23  24  25  26  27  28  \(c_{i}\)  
\(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  \(c_{gi}\)  
Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  Est.  SE  
ASBG04  2.02  0.13  1.31  0.12  0.95  0.10  1.24  0.11  0.86  0.10  0.98  0.10  1.28  0.10  0.48  0.12  1.48  0.03  
ASDG05  1.54  0.12  1.22  0.12  0.27  0.10  0.88  0.12  0.30  0.11  0.34  0.10  1.35  0.17  0.15  0.11  1.02  0.03  
ASBH15  2.13  0.14  1.64  0.16  1.25  0.11  1.14  0.10  1.03  0.10  1.03  0.10  1.89  0.13  0.59  0.14  1.65  0.04  
ASDHED  2.82  0.15  3.44  0.21  2.29  0.18  1.76  0.12  1.38  0.12  3.32  0.20  1.87  0.15  3.25  0.24  1.89  0.04  
ASDHOC  3.15  0.23  1.38  0.12  2.01  0.18  2.04  0.17  1.81  0.16  1.95  0.16  2.10  0.19  1.75  0.16  1.79  0.04  
\(c_g\)  2.04  0.08  1.62  0.07  1.24  0.06  1.28  0.06  1.00  0.06  1.31  0.06  1.68  0.08  1.16  0.06  1.55  0.01 
With regard to the assumption that the contribution of the HRL items to the HRL index would vary while the influence of the items remained constant across countries (\(c_{gi}\)), we found that the indicator “number of home study supports” was least informative with respect to the measured construct. This result supports the findings from the descriptive statistics: having a connection to the internet and/or one’s own room at home seem to have been standards and not exceptions for the fourthgrade students both within and across the countries participating in PIRLS/TIMSS 2011. The educational status of the students’ parents best explained the differences in the HRL index. The duality between parents’ educational status and number of home study supports increased when the countryspecific measurement models (\(c_{gi}\)) were assumed (Model 4). In this case, parents’ highest educational level contributed to the HRL index in most countries approximately two to four times more than the number of home study supports did. This finding suggests that the original HRL index did overestimate the influence of all indicators, with the exception of “highest level of education of either parent” (the influence of which, in turn, was underestimated).
Variance of the discrimination parameter \(c_{gi}\) across countries (given item i), \(\chi ^2\)value and asymmetric confidence interval (\(CI_l\) lower bound, \(CI_u\) upper bound; items ordered in descending order of \(s^{2}_{gii}\))
Item  \(s^{2}_{gii}\)  \(\chi ^{2}\)  \(CI_{l}\)  \(CI_{u}\)  

ASDHED  1.75  11,800.07  1.09  3.24  
ASDG05  0.63  4285.24  0.40  1.18  
ASBH15  0.58  3896.91  0.36  1.07  
ASBG04  0.44  2966.86  0.27  0.81  
ASDHOC  0.22  1498.92  0.14  0.41  \(\blacktriangle\) 
Variance of the discrimination parameter \(c_{gi}\) across items (given country g), χ ^{2}value and asymmetric confidence interval (\(CI_l\) lower bound, \(CI_u\) upper bound; countries ordered in descending order of \(s^2_{gig}\))
Country  \(s^2_{gig}\)  \(\chi ^2\)  \(CI_l\)  \(CI_u\)  

Qatar  2.75  2748.28  0.99  22.69  
Australia  2.42  2424.20  0.87  20.02  
Iran, Islamic Rep. of  1.98  1975.20  0.71  16.31  
Malta  1.97  1971.18  0.71  16.28  
Abu Dhabi, UAE  1.62  1617.25  0.58  13.35  
Spain  1.34  1337.09  0.48  11.04  
Poland  1.21  1208.22  0.43  9.98  
Morocco  1.13  1132.15  0.41  9.35  
Saudi Arabia  0.88  878.82  0.32  7.26  
Hungary  0.83  826.57  0.30  6.83  
Oman  0.73  734.58  0.26  6.07  
Ireland  0.69  690.84  0.25  5.70  
Singapore  0.66  663.00  0.24  5.47  
Chinese Taipei  0.66  660.04  0.24  5.45  
Austria  0.48  476.37  0.17  3.93  
Romania  0.42  415.65  0.15  3.43  \(\blacktriangle\) 
Italy  0.41  409.38  0.15  3.38  \(\blacktriangle\) 
Finland  0.33  330.66  0.12  2.73  \(\blacktriangle \triangle\) 
Georgia  0.33  327.40  0.12  2.70  \(\blacktriangle \triangle\) 
Slovenia  0.32  319.78  0.11  2.64  \(\blacktriangle \triangle\) 
Lithuania  0.31  314.68  0.11  2.60  \(\blacktriangle \triangle\) 
Honduras  0.28  278.98  0.10  2.30  \(\blacktriangle \triangle \bullet\) 
Germany  0.24  238.93  0.09  1.98  \(\blacktriangle \triangle \bullet\) 
Slovak Republic  0.22  223.86  0.08  1.85  \(\blacktriangle \triangle \bullet\) 
Azerbaijan  0.21  218.44  0.08  1.80  \(\blacktriangle \triangle \bullet\) 
Sweden  0.13  132.81  0.05  1.10  \(\blacktriangle \triangle \bullet \circ\) 
Croatia  0.13  126.27  0.05  1.04  \(\blacktriangle \triangle \bullet \circ\) 
Czech Republic  0.12  118.46  0.05  0.98  \(\blacktriangle \triangle \bullet \circ\) 
Prediction model
When conducting a statistical comparison of the distribution, we used a global Ftype statistic in the first step. However, none of the \(G\times z!/2!(z2)!=168\) derived F values were statistically significant. Thus, the overall hypotheses \(\text {H}_{\mathbf{0}}: \varvec{L}_{\varvec{g}}(\varvec{\beta }_{\varvec{gw}}\varvec{\beta }_{\varvec{gq}})={\mathbf{0}}\) cannot be rejected in any of the cases. This finding corresponds with the invariance of the observed distribution of the fixed effects \(\hat{\beta }_0\), \(\hat{\beta }_1\) and \(\hat{\beta }_2\) across scaling models: when three out of five fixed effects are virtually unaffected by the scaling procedure, no overall effects (as measured by the Ftype statistic) can be expected. When we took a closer look at the results emerging from the use of the variance of the different estimated fixed effects across scaling models given the country, that is \(s^2_{\hat{\beta }_{jgz.g}}\), we found virtually no variation across the models for the estimated fixed effects \(\hat{\beta }_0\), \(\hat{\beta }_1\), and \(\hat{\beta }_2\). We can therefore assume that this lack of variation explains the results of the Ftype statistic.
Distribution of \(\hat{\beta }_3\) across scaling models and countries, \(\chi ^2\)value and asymmetric confidence interval (\(CI_l\) lower bound, \(CI_u\) upper bound; countries ordered in descending order of the conditional variance of \(\hat{\beta }_3\) across scaling models given country g)
Country  \(\hat{\beta }_3\)  \(s^2_{\hat{\beta }_{3gz.g}}\)  \(\chi ^2\)  \(CI_l\)  \(CI_u\)  

1  2  3  4  
Iran, Islamic Rep. of  20.77  26.94  22.35  30.60  19.96  14,972.88  6.41  277.54  
Malta  37.08  32.20  35.90  27.47  18.72  14,038.53  6.01  260.23  
Slovenia  42.95  36.89  43.05  35.31  16.29  12,218.83  5.23  226.49  
Czech Republic  38.89  33.79  38.75  31.75  12.90  9676.91  4.14  179.37  
Abu Dhabi, UAE  18.66  16.57  21.90  24.43  12.07  9050.66  3.87  167.76  
Qatar  29.36  25.81  30.85  33.90  11.29  8467.50  3.62  156.95  
Romania  35.52  41.32  36.55  42.18  11.18  8383.35  3.59  155.39  
Austria  39.15  34.52  39.11  33.51  8.89  6669.48  2.85  123.63  
Germany  32.59  30.02  31.43  25.93  8.42  6313.93  2.70  117.03  
Ireland  43.67  42.27  41.90  37.96  5.99  4492.46  1.92  83.27  
Morocco  1.60  2.08  3.83  7.00  5.98  4482.99  1.92  83.10  
Slovak Republic  40.91  37.52  41.50  36.85  5.52  4142.23  1.77  76.78  
Oman  33.44  36.40  34.36  38.34  4.78  3583.85  1.53  66.43  
Croatia  25.08  21.49  25.84  22.19  4.54  3406.78  1.46  63.15  
Italy  27.06  23.71  27.00  23.40  4.04  3029.52  1.30  56.16  
Georgia  25.31  23.78  26.93  22.26  4.03  3020.26  1.29  55.98  
Lithuania  26.17  23.48  27.10  24.04  2.95  2210.88  0.95  40.98  
Hungary  44.69  47.21  44.15  47.28  2.71  2033.31  0.87  37.69  
Honduras  −2.39  −3.81  −0.59  −0.36  2.65  1984.88  0.85  36.79  
Singapore  29.05  26.54  29.88  27.58  2.22  1664.81  0.71  30.86  
Azerbaijan  22.50  24.54  22.70  25.43  2.04  1530.08  0.65  28.36  \(\blacktriangle\) 
Poland  37.16  35.03  36.58  34.20  1.86  1397.34  0.60  25.90  \(\blacktriangle \triangle\) 
Spain  26.46  24.59  26.00  23.60  1.71  1285.60  0.55  23.83  \(\blacktriangle \triangle \bullet\) 
Australia  40.50  39.20  40.92  38.52  1.25  935.29  0.40  17.34  \(\blacktriangle \triangle \bullet \circ\) 
Saudi Arabia  11.24  11.27  12.91  13.28  1.15  865.93  0.37  16.05  \(\blacktriangle \triangle \bullet \circ \blacktriangleleft\) 
Chinese Taipei  28.56  27.29  29.20  28.07  0.65  487.64  0.21  9.04  \(\blacktriangle \triangle \bullet \circ \blacktriangleleft \triangleleft\) 
Finland  25.44  24.89  25.37  23.78  0.59  440.63  0.19  8.17  \(\blacktriangle \triangle \bullet \circ \blacktriangleleft \triangleleft\) 
Sweden  28.58  29.69  28.26  28.55  0.40  299.94  0.13  5.56  \(\blacktriangle \triangle \bullet \circ \blacktriangleleft \triangleleft\) 
\(s^2_{\hat{\beta }_{3gz.z}}\)  135.82  129.56  119.69  102.71  
\(CL_l\)  84.90  80.99  74.82  64.20  
\(CL_u\)  251.64  240.04  221.75  190.28 
Overall, the variance in the estimated fixed effect \(\hat{\beta }_3\) across countries (with the scaling model held constant) decreased from \(s^2_{\hat{\beta }_{3g1.1}}=135.82\) to \(s^2_{\hat{\beta }_{3g4.4}}=102.71\) when we used the countryspecific measurement models for the HRL index instead of the measurement invariance model. The differences across the countries in the observed association between the HRL index on the individual level and mathematics achievement reduced by approximately 30% when noninvariance models were used to scale the HRL index. However, for some countries (Chinese Taipei, Finland, Sweden), the influence of the scaling model on the estimated fixed effects \(\hat{\beta }_3\) was very low. This finding was not surprising because the countryspecific measurement model for these countries strongly agreed with the measurement invariance model (with the exception of the indicator “number of home study supports”). As such, no variation between the fixed effects should have been observed.
Distribution of \(\hat{\beta }_4\) across scaling models and countries, \(\chi ^2\)value and asymmetric confidence interval (\(CI_l\) lower bound, \(CI_u\) upper bound; countries ordered in descending order of the conditional variance of \(\hat{\beta }_4\) across scaling models given country g)
Country  \(\hat{\beta }_4\)  \(s^2_{\hat{\beta }_{4gz.g}}\)  \(\chi ^2\)  \(CI_l\)  \(CI_u\)  

1  2  3  4  
Morocco  58.63  81.35  63.61  95.99  292.81  219,608.17  93.97  4070.68  
Honduras  59.33  84.15  61.26  90.95  255.90  191,922.86  82.12  3557.50  
Iran, Islamic Rep. of  67.99  89.82  67.52  90.78  169.56  127,170.62  54.41  2357.25  
Qatar  158.42  138.73  150.80  138.54  94.71  71031.94  30.39  1316.65  
Malta  73.36  63.28  67.27  55.85  53.95  40,465.24  17.31  750.07  
Czech Republic  79.55  69.36  78.16  64.74  50.34  37,756.23  16.16  699.85  
Romania  59.73  69.87  60.93  72.69  41.54  31,153.02  13.33  577.46  
Abu Dhabi, UAE  92.34  82.41  93.06  94.07  29.40  22,047.72  9.43  408.68  \(\blacktriangle\) 
Croatia  55.14  47.33  52.95  44.27  25.01  18,753.96  8.02  347.62  \(\blacktriangle\) 
Austria  61.39  54.21  59.05  50.74  22.92  17,191.50  7.36  318.66  \(\blacktriangle\) 
Slovenia  57.10  49.43  56.50  47.98  22.23  16,669.13  7.13  308.98  \(\blacktriangle\) 
Hungary  79.65  84.35  77.74  87.68  20.32  15,242.15  6.52  282.53  \(\blacktriangle\) 
Germany  71.93  66.12  70.09  62.12  19.05  14,287.74  6.11  264.84  \(\blacktriangle\) 
Singapore  58.35  53.59  56.26  50.03  12.90  9674.75  4.14  179.33  \(\blacktriangle \triangle\) 
Italy  51.74  45.44  50.29  44.61  12.43  9322.94  3.99  172.81  \(\blacktriangle \triangle\) 
Azerbaijan  42.30  45.40  43.83  49.96  10.96  8218.86  3.52  152.35  \(\blacktriangle \triangle\) 
Lithuania  48.05  43.30  46.25  40.65  10.65  7988.31  3.42  148.07  \(\blacktriangle \triangle\) 
Slovak Republic  53.89  49.75  54.01  48.02  9.04  6781.11  2.90  125.70  \(\blacktriangle \triangle \bullet\) 
Oman  53.24  57.78  52.77  53.87  5.23  3925.58  1.68  72.76  \(\blacktriangle \triangle \bullet \circ\) 
Spain  47.14  43.82  44.89  41.87  4.83  3622.23  1.55  67.14  \(\blacktriangle \triangle \bullet \circ\) 
Ireland  66.86  64.70  65.21  61.93  4.20  3148.12  1.35  58.35  \(\blacktriangle \triangle \bullet \circ \blacktriangleleft\) 
Georgia  47.75  44.71  49.56  48.01  4.12  3086.43  1.32  57.21  \(\blacktriangle \triangle \bullet \circ \blacktriangleleft\) 
Saudi Arabia  32.06  32.66  35.73  34.77  3.00  2253.27  0.96  41.77  \(\blacktriangle \triangle \bullet \circ \blacktriangleleft \triangleleft\) 
Australia  99.67  96.50  99.84  97.20  2.90  2176.08  0.93  40.34  \(\blacktriangle \triangle \bullet \circ \blacktriangleleft \triangleleft\) 
Poland  49.70  46.89  47.99  46.38  2.16  1619.93  0.69  30.03  \(\blacktriangle \triangle \bullet \circ \blacktriangleleft \triangleleft\) 
Chinese Taipei  50.96  48.85  49.95  48.21  1.47  1102.95  0.47  20.44  \(\blacktriangle \triangle \bullet \circ \blacktriangleleft \triangleleft\) 
Sweden  56.36  58.75  56.76  57.87  1.17  880.25  0.38  16.32  \(\blacktriangle \triangle \bullet \circ \blacktriangleleft \triangleleft\) 
Finland  38.91  38.08  38.31  36.58  0.98  736.06  0.31  13.64  \(\blacktriangle \triangle \bullet \circ \blacktriangleleft \triangleleft\) 
\(s^2_{\hat{\beta }_{4gz.z}}\)  577.60  511.62  518.80  597.73  
\(CL_l\)  361.05  319.80  324.29  373.63  
\(CL_u\)  1070.11  947.88  961.18  1107.41 
Distribution of random effects \(\varvec{G}\) across scaling models and countries (Part I)
Country  Effect  1  2  3  4  

\(\hat{\alpha }_{0}\)  \(\hat{\alpha }_{1}\)  \(\hat{\alpha }_{0}\)  \(\hat{\alpha }_{1}\)  \(\hat{\alpha }_{0}\)  \(\hat{\alpha }_{1}\)  \(\hat{\alpha }_{0}\)  \(\hat{\alpha }_{1}\)  
Azerbaijan  \(\hat{\alpha }_{0}\)  5248.25  153.86  5244.57  162.60  5226.53  74.15  5248.85  291.87 
\(\hat{\alpha }_{1}\)  153.86  443.62  162.60  549.16  74.15  417.92  291.87  650.26  
Australia  \(\hat{\alpha }_{0}\)  590.11  24.08  588.47  25.68  592.42  −4.31  587.72  10.40 
\(\hat{\alpha }_{1}\)  24.08  475.07  25.68  592.42  −4.31  428.68  10.40  415.55  
Austria  \(\hat{\alpha }_{0}\)  407.73  −6.63  402.97  −7.05  423.97  −3.26  415.20  −7.77 
\(\hat{\alpha }_{1}\)  −6.63  51.65  −7.05  35.44  −3.26  47.70  −7.77  30.53  
Chinese Taipei  \(\hat{\alpha }_{0}\)  176.44  −47.49  176.16  −46.11  171.42  −47.08  166.76  −42.88 
\(\hat{\alpha }_{1}\)  −47.49  74.59  −46.11  69.47  −47.08  77.28  −42.88  71.36  
Croatia  \(\hat{\alpha }_{0}\)  212.37  −3.58  210.72  −3.33  211.57  −5.62  211.39  −6.56 
\(\hat{\alpha }_{1}\)  −3.58  55.29  −3.33  43.16  −5.62  55.03  −6.56  40.84  
Czech Republic  \(\hat{\alpha }_{0}\)  244.49  −74.17  244.84  −63.68  244.47  −82.20  251.91  −86.89 
\(\hat{\alpha }_{1}\)  −74.17  344.00  −63.68  260.64  −82.20  361.47  −86.89  280.36  
Finland  \(\hat{\alpha }_{0}\)  350.47  −49.36  350.52  −48.31  347.46  −41.26  344.18  −34.64 
\(\hat{\alpha }_{1}\)  −49.36  66.14  −48.31  63.19  −41.26  62.60  −34.64  64.96  
Georgia  \(\hat{\alpha }_{0}\)  2429.86  −279.28  2432.33  −259.49  2428.40  −270.16  2420.54  −264.22 
\(\hat{\alpha }_{1}\)  −279.28  376.21  −259.49  329.27  −270.16  370.75  −264.22  312.49  
Germany  \(\hat{\alpha }_{0}\)  416.96  47.33  414.62  43.03  423.58  36.03  475.08  15.70 
\(\hat{\alpha }_{1}\)  47.33  73.62  43.03  60.61  36.03  68.99  15.70  45.10  
Honduras  \(\hat{\alpha }_{0}\)  2131.60  293.87  2181.93  413.57  2096.88  239.06  2164.52  261.29 
\(\hat{\alpha }_{1}\)  293.87  372.70  413.57  852.45  239.06  297.90  261.29  716.51  
Hungary  \(\hat{\alpha }_{0}\)  578.70  −190.88  578.38  −198.93  584.08  −184.64  672.32  −168.77 
\(\hat{\alpha }_{1}\)  −190.88  183.69  −198.93  206.38  −184.64  174.82  −168.77  257.79  
Iran, Islamic Rep. of  \(\hat{\alpha }_{0}\)  1536.11  73.79  1559.85  109.15  1564.47  47.99  1572.80  37.30 
\(\hat{\alpha }_{1}\)  73.79  275.83  109.15  514.43  47.99  242.12  37.30  446.03  
Ireland  \(\hat{\alpha }_{0}\)  643.26  −56.05  643.50  −55.20  650.27  −55.71  664.92  −50.11 
\(\hat{\alpha }_{1}\)  −56.05  51.56  −55.20  48.60  −55.71  47.17  −50.11  45.44  
Italy  \(\hat{\alpha }_{0}\)  1346.42  −112.52  1349.74  −98.70  1339.77  −114.57  1339.30  −108.21 
\(\hat{\alpha }_{1}\)  −112.52  267.16  −98.70  206.35  −114.57  245.99  −108.21  170.05  
Lithuania  \(\hat{\alpha }_{0}\)  334.13  −63.92  333.83  −56.78  337.52  −69.82  346.11  −72.32 
\(\hat{\alpha }_{1}\)  −63.92  278.92  −56.78  226.04  −69.82  277.76  −72.32  222.45  
Malta  \(\hat{\alpha }_{0}\)  468.51  41.21  472.58  30.92  470.84  34.23  518.49  2.89 
\(\hat{\alpha }_{1}\)  41.21  29.89  30.92  19.98  34.23  30.08  2.89  8.49 
Distribution of random effects \(\varvec{G}\) across scaling models and countries (part II)
Country  Effect  1  2  3  4  

\(\hat{\alpha }_{0}\)  \(\hat{\alpha }_{1}\)  \(\hat{\alpha }_{0}\)  \(\hat{\alpha }_{1}\)  \(\hat{\alpha }_{0}\)  \(\hat{\alpha }_{1}\)  \(\hat{\alpha }_{0}\)  \(\hat{\alpha }_{1}\)  
Morocco  \(\hat{\alpha }_{0}\)  4358.44  85.63  4433.22  104.05  4314.43  53.49  4311.60  −97.25 
\(\hat{\alpha }_{1}\)  85.63  432.09  104.05  959.00  53.49  392.12  −97.25  825.05  
Oman  \(\hat{\alpha }_{0}\)  2187.73  17.59  2186.26  15.53  2176.12  −38.90  2217.85  −108.31 
\(\hat{\alpha }_{1}\)  17.59  319.17  15.53  377.71  −38.90  294.01  −108.31  349.03  
Poland  \(\hat{\alpha }_{0}\)  322.04  −40.50  322.36  −37.82  321.52  −44.02  317.37  −36.29 
\(\hat{\alpha }_{1}\)  −40.50  42.59  −37.82  39.27  −44.02  40.80  −36.29  37.58  
Qatar  \(\hat{\alpha }_{0}\)  1823.03  124.61  1824.27  107.88  1843.10  124.99  2290.76  154.27 
\(\hat{\alpha }_{1}\)  124.61  217.05  107.88  169.92  124.99  186.00  154.27  181.57  
Romania  \(\hat{\alpha }_{0}\)  3132.69  −574.41  3133.08  −652.34  3119.28  −558.05  3138.69  −577.43 
\(\hat{\alpha }_{1}\)  −574.41  863.19  −652.34  1159.58  −558.05  818.38  −577.43  1098.26  
Saudi Arabia  \(\hat{\alpha }_{0}\)  3929.55  −146.25  3930.06  −147.93  3891.32  −174.19  3900.55  −180.56 
\(\hat{\alpha }_{1}\)  −146.25  379.72  −147.93  392.95  −174.19  340.87  −180.56  290.76  
Singapore  \(\hat{\alpha }_{0}\)  275.71  −9.82  273.85  −10.80  275.53  −16.42  281.66  −19.49 
\(\hat{\alpha }_{1}\)  −9.82  134.89  −10.80  112.66  −16.42  123.45  −19.49  86.69  
Slovak Republic  \(\hat{\alpha }_{0}\)  1359.48  −314.15  1353.27  −287.64  1355.48  −318.11  1381.23  −298.18 
\(\hat{\alpha }_{1}\)  −314.15  308.61  −287.64  256.59  −318.11  304.41  −298.18  250.48  
Slovenia  \(\hat{\alpha }_{0}\)  234.97  −33.21  234.17  −32.55  232.33  −43.19  234.77  −61.34 
\(\hat{\alpha }_{1}\)  −33.21  123.54  −32.55  91.74  −43.19  126.89  −61.34  98.47  
Spain  \(\hat{\alpha }_{0}\)  454.26  −8.45  453.99  −7.85  459.55  −9.76  494.53  −8.72 
\(\hat{\alpha }_{1}\)  −8.45  94.17  −7.85  80.19  −9.76  81.59  −8.72  74.55  
Sweden  \(\hat{\alpha }_{0}\)  134.78  −11.58  134.80  −11.66  135.76  −14.16  139.87  −15.65 
\(\hat{\alpha }_{1}\)  −11.58  159.41  −11.66  172.39  −14.16  167.74  −15.65  163.90  
Abu Dhabi, UAE  \(\hat{\alpha }_{0}\)  2138.29  46.38  2136.88  43.58  2006.65  28.57  1795.27  −51.16 
\(\hat{\alpha }_{1}\)  46.38  210.85  43.58  170.91  28.57  219.04  −51.16  221.58 
Fitvalues for equality test of \(\varvec{G}_{\varvec{gz}}\) across scaling models z given country g
Country  Fitvalue  

\(\chi ^2\)  df  p  SRMR  GFI  RMSEA  \(CI_l\)  \(CI_u\)  
Azerbaijan  323.41  9  <0.0001  0.11  0.98  0.08  0.08  0.09 
Australia  42.21  9  <0.0001  0.03  1.00  0.02  0.02  0.03 
Austria  426.86  9  <0.0001  0.13  0.98  0.10  0.09  0.11 
Chinese Taipei  21.22  9  0.01  0.03  1.00  0.02  0.01  0.03 
Croatia  177.11  9  <0.0001  0.08  0.99  0.06  0.06  0.07 
Czech Republic  195.84  9  <0.0001  0.08  0.99  0.07  0.06  0.08 
Finland  37.48  9  <0.0001  0.03  1.00  0.03  0.02  0.04 
Georgia  64.84  9  <0.0001  0.05  1.00  0.04  0.03  0.04 
Germany  355.66  9  <0.0001  0.13  0.98  0.10  0.10  0.11 
Honduras  1561.82  9  <0.0001  0.32  0.90  0.21  0.20  0.22 
Hungary  817.51  9  <0.0001  0.09  0.96  0.13  0.12  0.14 
Iran, Islamic Rep. of  1157.06  9  <0.0001  0.21  0.95  0.15  0.14  0.16 
Ireland  24.14  9  0.004  0.03  1.00  0.02  0.01  0.03 
Italy  259.85  9  <0.0001  0.11  0.98  0.08  0.07  0.09 
Lithuania  125.05  9  <0.0001  0.07  0.99  0.05  0.05  0.06 
Malta  1701.07  9  <0.0001  0.49  0.87  0.23  0.22  0.24 
Morocco  2346.54  9  <0.0001  0.27  0.92  0.18  0.18  0.19 
Oman  324.51  9  <0.0001  0.06  0.99  0.06  0.05  0.06 
Poland  31.48  9  0.0002  0.03  1.00  0.02  0.01  0.03 
Qatar  153.41  9  <0.0001  0.08  0.99  0.06  0.05  0.07 
Romania  237.43  9  <0.0001  0.09  0.99  0.07  0.06  0.08 
Saudi Arabia  136.58  9  <0.0001  0.07  0.99  0.05  0.04  0.06 
Singapore  357.01  9  <0.0001  0.11  0.99  0.08  0.07  0.09 
Slovak Republic  128.30  9  <0.0001  0.06  0.99  0.05  0.04  0.06 
Slovenia  354.37  9  <0.0001  0.09  0.98  0.09  0.08  0.10 
Spain  70.70  9  <0.0001  0.05  1.00  0.04  0.03  0.05 
Sweden  11.68  9  0.23  0.02  1.00  0.01  0.00  0.02 
Abu Dhabi, UAE  192.68  9  <0.0001  0.09  0.99  0.07  0.06  0.08 
Discussion
This paper investigated the relationships between different procedures for scaling the “home resources for learning index” (HRL) and the prediction accuracy of this index in explaining the mathematics achievement of the fourthgrade students who participate in IEA’s combined PIRLS/TIMSS survey of 2011. As work by Lüdtke et al. (2011) and van den HeuvelPanhuizen et al. (2009) has shown, scaling social background indicators into a latent variable enhances the validity of largescale educational assessment studies. The content validity and the reliability of such an index are usually much higher than those of single indicators. Because both aspects are particularly important within the context of crossnational comparative studies of educational achievement, using a scaled index for PIRLS/TIMSS home environment (social background) variables provided a framework that enabled meaningful crossnational comparisons.
While the scaling of the social background indicators into a latent variable is without dispute, and probably without a reasonable alternative, the assumption of measurement invariance evident in scaling the HRL index needs to be challenged. As prior research on the scaling of social background indicators into latent indices in largescale assessments have shown, assuming a measurement invariance model across countries results in latent variables that are less reliable than those that occur when assuming measurement noninvariance (Caro and SandovalHernandez 2012; Hansson and Gustafsson 2013; Lakin 2012). In our study, rescaling the HRL index with four different measurement models with different degrees of assumed measurement invariance also showed that the measurement noninvariance model fitted the data best. Thus, with respect to our first research question we can assume that measurement invariance across participating countries for the HRL index would not hold for the Grade 4 students assessed in PIRLS/TIMSS 2011.
From a methodical perspective, we were not surprised to find that our less restrictive model (the measurement noninvariance model) was superior to our more restrictive model (the measurement invariance model) in terms of fitting indices. Everything else being equal, a model where the parameters can take on any value will always fit at least as well as a model where some of the parameters are fixed to some value or where some of the parameters are set to constraints. It could be argued that the measurement invariance assumption is merely a practical matter because it makes crossnational comparative studies of educational achievement possible through use of model that most parsimoniously describes the data yet also describes the data sufficiently well to explain any observed achievement differences. However, viewing this matter from the perspective of predictive validity challenges this argument. Given the general inconsistency of measurement invariance and predictive invariance that Millsap (1995, 1997, 1998, 2007) found, we could expect that the most parsimonious model (the measurement invariance model) for latent variables would affect ability to compare the prediction coefficients of this latent variable across countries. Accordingly, with regard to the HRL index, we need to establish whether the hierarchical linear model applied by Martin et al. (2013) was sensitive to the assumption of measurement invariance.
To investigate that question, we rescaled the HRL index four times, with each scaling allowing a different degree of measurement invariance. We then introduced these indices as predictors in a generalized linear mixed model (GLMM) with mathematics achievement as the dependent variable. Overall, we observed a strong influence of the scaling model on the prediction outcomes of the GLMM. Assuming countryspecific measurement models for the HRL index decreased the crossnational variance of the individual effect of the HRL index on student mathematics achievement. The variance across countries of this effect was \(s^2_{\beta _{3gz.z}}=135.82\) for the measurement invariance model. However, the strength of the effect dropped to \(s^2_{\beta _{3gz.z}}=102.71\) for the measurement noninvariance model. Accordingly, the crossnational differences of this effect, expressed in terms of the crossnational variance of \(\hat{\beta }_3\), can be reduced by approximately 25% when a measurement noninvariance model is assumed for the HRL index. This finding implies that those countries classified as unequal with respect to this effect when the measurementinvariance assumption applied, that is, Iran (Islamic Rep. of) and Slovenia, would be categorized as equal under the assumption of measurement noninvariance.
The results for the schoollevel effect of the HRL index were not as conclusive. Although we observed only a small difference in the crossnational variance of this effect when we compared the measurement invariance with the countryspecific and itemspecific measurement model (Model 1 vs. Model 4), we found the reduction in variance was substantial when a countryspecific (but not an itemspecific measurement model) was assumed (Model 2), or when an itemspecific measurement model (but not a countryspecific model) was assumed (Model 3). In both cases, the crossnational variance of the schoollevel effect of the HRL index reduced by about 11%. One explanation for these somewhat unpredictable results could be that the four HRL indices were scaled in the same way as in the study by Martin et al. (2013), that is, without taking the multilevel structure of the data into account. Loosely speaking, this possibility implies that the applied scaling procedure “ignored” the betweenschool part of the HRL index. Further research directed toward differentiating between a level one measurement invariance assumption and a level two measurement invariance assumption is needed. Nevertheless, application of the scaling procedure that Martin et al. used will result in schoollevel prediction effects of the HRL index that are obviously sensitive to the assumed degree of measurement invariance.
Although the effect of the measurement invariance assumption on crossnational comparisons of the fixed effects of the GLMM was the main focus of the present study, we also investigated countryspecific differences in the effect of the measurement invariance assumption on the prediction coefficients. We were not surprised to find this effect was not constant across countries. For example, the influence of the measurement model on both the individual and the schoollevel HRL coefficients was relatively strong in Iran (Islamic Rep. of), Malta, Czech Republic, Abu Dhabi (UAE), Qatar, and Romania, but was relatively weak in Australia, Saudi Arabia, Chinese Taipei, Finland, and Sweden. We can express this point in another way by stating that the regression coefficients for Finland, for example, were relatively robust with respect to the different assumptions about measurement invariance, while the coefficients for Iran (Islamic Rep. of) were very sensitive with respect to the assumed scaling model. The implication of this finding is that even when only the countryspecific regression coefficients are of interest, we need to take the assumed degree of measurement invariance into account when interpreting the coefficients.
We were also able to observe the countryspecific effects of the measurement invariance assumption on the prediction validity of the GLMM’s random slope coefficients. In most countries, the random variance of this coefficient decreased when a noninvariance model was assumed. The fact that we can interpret the random coefficient as a measure of the schoolspecific effect on the relationship between the individual HRL index and mathematics achievement, basically implies that, under the noninvariance model, differences between schools are a less suitable way of explaining the relationship between the HRL index and mathematics achievement. Accordingly, under the noninvariance assumption, we can expect that this relationship would be nearly the same in all schools of most of the participating countries, while under the measurement invariance model the relationship between the HRL index and mathematics achievement would vary across these schools. In short, researchers and others may draw completely different conclusions with respect to this effect because the nature of the effect will depend solely on the assumed measurement model.
The important point here is that the results of the hierarchical linear model that Martin et al. (2013) applied are very sensitive in terms of the assumed degree of measurement invariance. According to Millsap’s (1995, 1997, 1998, 2007) findings this degree of sensitivity can be expected. However, if researchers agree that using latent variables in educational research is sound practice, and if assuming measurement invariance is a necessary requirement for crossnational comparisons of latent variables, it is vital to consider the question of how researchers engaged in largescale assessment studies can control for these effects or take them into account.
While a comprehensive answer to this question will rely on further research and on more expertise, and although the research agenda of the IEAETS Research Institute calls for “a more scientific approach to the development, use and interpretability of background questionnaires” (http://ierinstitute.org/researchagenda.html, Accessed 04 May 2016), we can still offer some general ideas. For example, according to Brennan’s (2001) generalizability theory, the variance in the GLMM coefficients that can be traced back to different assumptions about measurement invariance should be added to the standard errors of these coefficients. In regard to the results of the present study, this advice implies that, for example, the variance of \(s^2_{\hat{\beta }_{3gz.g}}=19.96\) for Iran (Islamic Rep. of) (see Table 8) should be added to the standard error of \(\hat{\beta }_3\). Of course, more reliable estimates of this component are possible if we undertake a more exhaustive analysis where we implement a broader range of possible measurement models and also account for the random sample of students (by, for example, using bootstrapping methods).
Another approach that we could use to capture the dependency between measurement invariance and predictive invariance in largescale assessment studies is the assumption of partial measurement invariance. This approach implies, for example, that measurement invariance across countries can be assumed for only some of the HRL index items and that the parameters of the other items will be left to vary freely across countries. This linking or equation procedure means that while the latent variable across countries may still be compared, it must be acknowledged that dependency between the measurement invariance and the predictive invariance will decrease (if not vanish). Again, taking the present study as an example, the parameters of the HRL indicators “highest level of education of either parent” and “number of home study supports” would need to vary freely across countries, because these indicators are the ones that exhibit the highest variance in the discrimination parameter across countries (see Table 6). However, as we stated above, more exhaustive analysis are necessary before decision as concrete as this one can be made. One requirement that would need to be in place before this degree of analysis could be implemented for the HRL is surely that of defining the item sampling space for the HRL. Achieving this requirement, in turn, implies the need to develop a theoretical framework for the HRL index that is coherent and valid and reliable crossnationally, but whether this aim can be credibly achieved is a moot point.
Limitations of the present study
Although our study is the first study to provide a deeper insight into the relationship between measurement invariance and predictive invariance in largescale assessment studies and thus contributes, for example, to the research agenda of the IEAETS Research Institute, it has some limitations. The first is the index that we used. While it made sense for us to focus on the HRL index, it could be interpreted as a formative variable. As such, studying the relationship of measurement invariance and predictive invariance with the more reflective indices that are also part of, for example, TIMSS and PIRLS, seems advisable. In addition, the applied measurement model could be more exhaustive if it took into account the multilevel structure of the data and gave consideration to scaling models that have more parameters (or dimensions). In general, we did not know the true parameters of the models (both the scaling model and the prediction model) when we conducted our study. This lack of knowledge meant that we were unable to estimate the unbiased effect of the scaling model on the prediction coefficients. This consideration calls for implementation of another design, such as that used in simulation studies. Despite these limitations, we consider that the general inconsistency of measurement invariance and predictive invariance found in this study will remain valid even when these limitations have been satisfactorily resolved. We therefore think it safe to state that assuming measurement invariance of background indicators in crossnational studies of educational achievement is a challenge that needs to be addressed by anyone endeavoring to interpret crossnational differences in achievement.
The data sets are freely available under http://timss.bc.edu/timsspirls2011/internationaldatabase.html.
These data sets contained all necessary variables for the analysis. For a detailed description of the data sets, see Foy (2013).
Copyright © 20022012 SAS Institute Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc., Cary, NC, USA.
Due to iteration problems, the GLMM could not be fitted to nine countries: Botswana, Dubai (UAE), Hong Kong SAR, Northern Ireland, Norway, Quebec (Canada), Russian Federation, and United Arab Emirates. The student samples from these countries were therefore not used in this study.
Note that the newly created HRL indices were not, as was the case with the original HRL index, transformed to an \(N\sim (10.03, 1.82)\) metric. Instead, we left the scaling metric \(N\sim (0,1)\) unchanged. We chose to do this because the transformation that Martin et al. (2013) applied made sense when the latent variable was measured on the same scale, that is, when measurement invariance between countries was assumed. When countryspecific models were assumed for the HRL index, some equating procedures between the countryspecific distributions of the HRL index first had to be applied to make the transformation of these values meaningful. However, analyzing the influence of different equating procedures on the HRL index and thus on the GLMM results was beyond the scope of this paper.
Abbreviations
 AIC:

Akaike’s information criterion
 BIC:

Bayesian information criterion
 GLMM:

generalized linear mixed model
 ET:

early literacy tasks/early numeracy tasks
 HRL:

home resources for learning
 IEA:

International Association for the Evaluation of Educational Achievement
 MAP:

maximum a posterior probability
 PIRLS:

Progress in International Reading Literacy Study
 TIMSS:

Trends in International Mathematics and Science Study
 WLE:

weighted likelihood estimate
Declarations
Authors’ contributions
All authors made substantial contributions to the conception and the design of the study. In addition, HW provided the data sets for the analysis and DK conducted the analysis. DK drafted the manuscript. All authors made substantial contribution to the interpretation of the results. All authors read and approved the final manuscript.
Acknowledgements
The authors acknowledge the PIRLS/TIMSS International Study Center and Boston College for providing the technical documentation that allowed the replication of the key reference models published in Martin et al. (2013). The authors further acknowledge Wilfried Bos and the anonymous reviewers for the attention and expertise they generously shared to support the production of this paper. We finally thank Daniel Scott Smith and Paula Wagemaker for presubmission English editing support.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans Autom Control, 19, 716–723.View ArticleGoogle Scholar
 Bourdieu, P. (1986). The forms of capital. In J. Richardson (Ed.), Handbook of theory and research for the sociology of education (pp. 241–258). New York: Greenwood.Google Scholar
 Bos, W., Wendt, H., Köller, O., & Selter, C. (2012). TIMSS 2011. Mathematische und naturwissenschaftliche Kompetenzen von Gundschulkindern in Deutschland im internationalen Vergleich. Münster: Waxmann.Google Scholar
 Brennan, R. L. (2001). Generalizability theory. New York: Springer.View ArticleGoogle Scholar
 Caro, D., SandovalHernandez, A., & Lütke, O. (2014). Cultural, social and economic capital constructs: An evaluation using exploratory structural equation modeling. Sch Eff Sch Improv, 25, 433–450.View ArticleGoogle Scholar
 Caro, D., & SandovalHernandez, A. (2012). A exploratory structural equation modeling approach to evaluate sociological theories in international largescale assessment studies. In: Paper presented at the annual meeting of the American educational research association 2012
 Çetin, B. (2010). Crosscultural structural parameter invariance on PISA 2006 student questionnaire. Eurasian J Educ Res, 38, 71–89.Google Scholar
 Coleman, J. S. (1988). Social capital in the creation of human capital. Am J Sociol, 94, 95–120.View ArticleGoogle Scholar
 Fischer, G. H., & Molenaar, I. W. (1995) Rasch models. Foundations, recent developments, and applications. New York: Springer
 Foy, P. (2013). TIMSS and PIRLS 2011 user guide for the fourth grade combined international database. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College and International Association for the Evaluation of Educational Achievement (IEA).Google Scholar
 Hansson, Å., & Gustafsson, J.E. (2013). Measurement invariance of socioeconomic status across migrational background. Scand J Educ Res, 57, 148–166.View ArticleGoogle Scholar
 Karim, M. R., & Zeger, S. L. (1992). Generalized linear models with random effects salamander mating revisited. Biometrics, 48, 631–644.View ArticleGoogle Scholar
 Kasper, D. (2017). Multiple group comparisons of the fixed effects from the generalized linear mixed model. (In preparation)
 Lakin, J. M. (2012). Multidimensional ability tests and culturally and linguistically diverse students: Evidence of measurement invariance. Learn Individ Differ, 22, 397–403.View ArticleGoogle Scholar
 Lüdtke, O., Marsh, H. W., Robitzsch, A., & Trautwein, U. (2011). A 2 \(\times\) 2 taxonomy of multilevel latent contextual models: Accuracybias tradeoffs in full and partial error correction models. Psychol Methods, 16, 444–467.View ArticleGoogle Scholar
 Martin, M. O., & Mullis, I. V. S. (2012). Methods and procedures in TIMSS and PIRLS 2011. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. http://timss.bc.edu/methods/index.html. Accessed 20 Feb 2017.
 Martin, M. O., & Mullis, I. V. S. (2013). TIMSS and PIRLS 2011: Relationships among reading, mathematics, and science achievement at the fourth grade—implications for early learning. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College and International Association for the Evaluation of Educational Achievement (IEA).Google Scholar
 Martin, M. O., Mullis, I. V. S., Foy, P., Olson, J. F., Erbeber, E., & Preuschoff, C. (2008). TIMSS 2007 international science report: Findings from IEA’s trends in international mathematics and science study at the fourth and eighth grades. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.Google Scholar
 Martin, M., Mullis, I. V. S., Foy, P., & Stanco, G. M. (2012). TIMSS 2011 international results in science. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.Google Scholar
 Martin, M. O., Foy, P., Mullis, I. V. S., & O’Dwyer, L. M. (2013). Effective schools in reading, mathematics, and science at the fourth grade. In M. O. Martin & I. V. S. Mullis (Eds.), TIMSS and PIRLS 2011: Relationships among reading, mathematics, and science achievement at the fourth grade—implications for early learning. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College and International.Google Scholar
 Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.View ArticleGoogle Scholar
 McCulloch, C. E., & Searle, S. R. (2001). Generalized, linear, and mixed models. New York: Wiley.Google Scholar
 Millsap, R. E. (1995). Measurement invariance, predictive invariance, and the duality paradox. Multivar Behav Res, 30, 577–605.View ArticleGoogle Scholar
 Millsap, R. E. (1997). Invariance in measurement and prediction: Their relationship in the singlefactor case. Psychol Methods, 2, 248–260.View ArticleGoogle Scholar
 Millsap, R. E. (1998). Group differences in regression intercepts: Implications for factorial invariance. Multivar Behav Res, 33, 403–424.View ArticleGoogle Scholar
 Millsap, R. E. (2007). Invariance in measurement and prediction revisited. Psychometrika, 72, 461–473.View ArticleGoogle Scholar
 Mullis, I. V. S., Martin, M. O., Kennedy, A. M., & Foy, P. (2007). PIRLS 2006 international report. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.Google Scholar
 Mullis, I. V. S., Martin, M. O., Foy, P., Olson, J. F., Preuschoff, C., Erbeber, E., et al. (2008). TIMSS 2007 international mathematics report: Findings from IEA’s trends in international mathematics and science study at the fourth and eighth grades. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.Google Scholar
 Mullis, I. V. S., Martin, M. O., Foy, P., & Drucker, K. T. (2012a). PIRLS 2011 international results in reading. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.Google Scholar
 Mullis, I. V. S., Martin, M. O., Foy, P., & Arora, A. (2012b). TIMSS 2011 international results in mathematics. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.Google Scholar
 Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Appl Psychol Meas, 16, 159–176.View ArticleGoogle Scholar
 Nagengast, B., & Marsh, H. W. (2013). Motivation and engagement in science around the globe: testing measurement invariance with multigroup structural equation models across 57 countries using PISA 2006. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international largescale assessment. Background, technical issues, and methods of data analysis, Chap. 15 (pp. 318–344). Boca Raton: Chapman and Hall/CRC.Google Scholar
 OECD. (2014a). PISA 2012 results: What students know and can do—student performance in mathematics, reading and science (Vol. I, Revised edition, February 2014). Paris: PISA OECD Publishing.
 OECD. (2014b). PISA 2012: Technical report. Paris: PISA, OECD Publishing.
 Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models. Applications and data analysis methods. London: Sage Publications.Google Scholar
 Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.View ArticleGoogle Scholar
 Schulte, K., Nonte, S., & Schwippert, K. (2013). Die Überprüfung von Messinvarianz in international vergleichenden Schulleistungsstudien am Beispiel der Studie PIRLS [Testing measurement invariance in international large scale assessments using the example of PIRLS data]. Zeitschrift für Bildungsforschung, 3, 99–118.View ArticleGoogle Scholar
 Schulz, W. (2005). Testing parameter invariance for questionnaire indices using confirmatory factor analysis and item response theory. Paper prepared for the Annual Meetings of the American Educational Research Association in San Francisco. http://files.eric.ed.gov/fulltext/ED493509.pdf. Accessed 20 Feb 2017.
 Schwarz, G. (1978). Estimating the dimension of a model. Ann Stat, 6(6), 461–464.View ArticleGoogle Scholar
 Segeritz, M., & Pant, H. A. (2013). Do they feel the same way about math? Testing measurement invariance of the PISA “students’ approaches to learning” instrument across immigrant groups within Germany. Educ Psychol Meas, 73, 601–630.View ArticleGoogle Scholar
 Smith, D. S., Wendt, H., & Kasper, D. (2016). Social reproduction and sex in German primary schools. Compare J Comp Int Educ,. doi:10.1080/03057925.2016.1158643.Google Scholar
 van den HeuvelPanhuizen, M., Robitzsch, A., Treffers, A., & Köller, O. (2009). Largescale assessment of change in student achievement: Dutch primary school students’ results on written division in 1997 and 2004 as an example. Psychometrika, 74, 351–365.View ArticleGoogle Scholar
 Wang, S., & Wang, T. (2001). Precision of warm’s weighted likelihood estimates for a polytomous model in computerized adaptive testing. Appl Psychol Meas, 25, 317–331.View ArticleGoogle Scholar
 Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450.View ArticleGoogle Scholar
 Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.Google Scholar
 Zeger, S. L., & Karim, M. R. (1991). Generalized linear models with random effects; A Gibbs sampling approach. J Am Stat Assoc, 86, 79–86.View ArticleGoogle Scholar