 Research
 Open Access
 Published:
Assuming measurement invariance of background indicators in international comparative educational achievement studies: a challenge for the interpretation of achievement differences
Largescale Assessments in Education volume 5, Article number: 10 (2017)
Abstract
Background
Largescale crossnational studies designed to measure student achievement use different social, cultural, economic and other background variables to explain observed differences in that achievement. Prior to their inclusion into a prediction model, these variables are commonly scaled into latent background indices. To allow crossnational comparisons of the latent indices, measurement invariance is assumed. However, it is unclear whether the assumption of measurement invariance has some influence on the results of the prediction model, thus challenging the reliability and validity of crossnational comparisons of predicted results.
Methods
To establish the effect size attributed to different degrees of measurement invariance, we rescaled the ‘home resource for learning index’ (HRL) for the 37 countries (\(n=166,709\) students) that participated in the IEA’s combined ‘Progress in International Reading Literacy Study’ (PIRLS) and ‘Trends in International Mathematics and Science Study’ (TIMSS) assessments of 2011. We used (a) two different measurement models [oneparameter model (1PL) and twoparameter model (2PL)] with (b) two different degrees of measurement invariance, resulting in four different models. We introduced the different HRL indices as predictors in a generalized linear mixed model (GLMM) with mathematics achievement as the dependent variable. We then compared three outcomes across countries and by scaling model: (1) the differing fitvalues of the measurement models, (2) the estimated discrimination parameters, and (3) the estimated regression coefficients.
Results
The least restrictive measurement model fitted the data best, and the degree of assumed measurement invariance of the HRL indices influenced the random effects of the GLMM in all but one country. For onethird of the countries, the fixed effects of the GLMM also related to the degree of assumed measurement invariance.
Conclusion
The results support the use of countryspecific measurement models for scaling the HRL index. In general, equating procedures could be used for crossnational comparisons of the latent indices when countryspecific measurement models are fitted. Crossnational comparisons of the coefficients of the GLMM should take into account the applied measurement model for scaling the HRL indices. This process could be achieved by, for example, adjusting the standard errors of the coefficients.
Background
Introduction
In order to report international trends in educational achievement over time and to compare achievement results across countries, the International Association for the Evaluation of Educational Achievement (IEA) conducts, among other studies, regular iterations of the Progress in International Reading Literacy Study (PIRLS) and the Trends in International Mathematics and Science Study (TIMSS). PIRLS has assessed the reading comprehension achievement of fourthgrade students every 5 years since 2001 (Mullis et al. 2012a), while TIMSS has assessed the mathematics and science achievement of fourth and eighthgrade students every 4 years since 1995 (Martin et al. 2012). In 2011, IEA conducted both studies jointly for the first time. Thirtyfour countries and three benchmark participants collected data on Grade 4 students’ educational achievement in three competence domains: reading comprehension, mathematics, and science (Martin and Mullis 2013).
In their efforts to explain observed achievement differences in the data from largescale assessment studies, researchers have increasingly combined different background indicators (Bos et al. 2012; Martin et al. 2008; Mullis et al. 2007, 2008; OECD 2014a) by scaling them into latent background variables. Scaling these variables usually requires application of an item response theory (IRT) model (Martin and Mullis 2012; OECD 2014b). The approach has several advantages, among which is the ability to control the measurement errors in the manifest variables. Controlling for measurement error is especially important in educational research studies because the multilevel prediction models commonly used in this area are very sensitive to these errors (Lüdtke et al. 2011).
Although using IRT models to scale latent background variables before including them in a prediction model works very well in largescale assessment studies, the method presents several challenges (van den HeuvelPanhuizen et al. 2009). First researchers wanting to use latent indices instead of manifest indicators need to develop a coherent theoretical framework for the construct they intend to measure. Second, they need to define the assessment’s desired target population and the sampling procedure. Third, they need to choose not only a suitable measurement model for the construct but also a statistical model that will allow them to scale the latent indices according to this model. Finally, they must specify a useful and appropriate prediction model.
These tasks also need to be considered within the context of two central challenges that researchers face when conducting crossnational studies of educational achievement. The first centers on the need to ensure that the indices used for international comparison are comparable across the countries participating in each study (Nagengast and Marsh 2013), and the second concerns the need to ensure that the latent variables are comparable across the participating countries. Researchers conducting these largescale assessment studies usually endeavor to meet these challenges by assuming measurement invariance across countries when they scale the latent indices. However, as work by Millsap (1995, 1997, 1998, 2007) shows, this approach leads to inconsistent measurement invariance and predictive invariance. Thus, when researchers assume that there will be measurement invariance across countries and then, during data analysis, use the scaled latent indices as predictors in the countryspecific prediction models, the prediction coefficients across countries will only be the same under very restricted conditions. However, researchers are unlikely to deem these conditions reasonable in practice. What is obvious here is that the different decisions that those designing largescale assessment studies must make before latent indices can be used, will influence the results of these studies. Generalizability theory calls these sources of influence facets or dimensions, and emphasizes that researchers must take the variance in those research results that can be traced back to these dimensions into account before they attempt to generalize the results (Brennan 2001).
The aim of the study presented in this paper was to investigate the extent to which the assumption of crossnational measurement invariance of latent background variables affected the results of prediction models that use these indices as predictors in largescale assessment studies. To achieve this aim, we reanalyzed the PIRLS/TIMSS 2011 data that Martin et al. (2013) used in their study on effective school environment. We considered this study especially useful for the desired purpose because Martin and colleagues used latent indices scaled under the assumption of crossnational measurement invariance as predictors in their countryspecific hierarchical linear models and then compared the results of these models across the countries. We considered that reanalyzing these data sets by allowing different degrees of crossnational measurement invariance could help to answer the question of whether this assumption has an influence on (1) the crossnational comparisons performed by Martin et al. (2013) in particular, and (2) the results of largescale assessment studies that use a design comparable to the one Martin and his colleagues employed in general. We begin by providing a summary of the study by Martin and his colleagues (2013). We then describe how we conducted our study, before presenting the results from that study and a discussion of those findings.
Assessment of Martin et al.’s study
Overview
Martin et al. (2013) performed a “school effectiveness” analysis of data from the 37 countries that participated in PIRLS/TIMSS 2011. According to Martin and his colleagues (2013, p. 111) “School effectiveness analyses seek to improve educational practice by studying what makes for a successful school beyond having a student body where most of the students are from advantaged socioeconomic backgrounds.” In their analysis, Martin et al. used five school effectiveness variables and two student home background variables as predictors in the countryspecific hierarchical linear models. They used students’ achievement scores (reading comprehension, mathematics achievement, science achievement) as dependent variables. Because the goal of the study was to “present an analytic framework that could provide an overview of how these relationships vary across countries”, (Martin et al. 2013, p. 110) the results from the hierarchical linear modeling could be assumed to be comparable across the participating countries.
One of the major findings of the study by Martin et al. (2013) was that the strength of the relationships between the school effectiveness variables and the student achievement scores decreased substantially in nearly all 37 countries when Martin et al. included the home background control variables in their models; countryspecific effects were also apparent. For example, in 15 countries, only one out of the five effectiveness indicators still presented a statistically significant prediction coefficient after Martin and his colleagues had controlled for students’ home background. In four countries, three prediction coefficients remained significant. If the results of these analyses were, in fact, comparable across countries, in most countries the strength of the relationships between school effectiveness variables and student achievement should be relatively weak after controlling for student home background.
However, by scaling the school effectiveness variables and the home background variables as latent variables, Martin et al. (2013) assumed measurement invariance across countries (see the next section). Thus, it is also possible that the crossnational variation of the prediction coefficients of the school effectiveness variables and the home background variables was at least partially a methodological artifact due to the general inconsistency of measurement invariance and predictive invariance. Studying the relationship between assumed measurement invariance and the observed prediction coefficients more closely therefore seems worthwhile. We accordingly decided that reanalyzing one of the data sets that Martin et al. (2013) used would be a useful exercise. We determined we could rescale one of the home background control variables (the “home resources for learning scale”, hereafter HRL) while assuming different degrees of crossnational measurement invariance. We could then, in an effort to explain students’ mathematics achievement, introduce the rescaled variable as a predictor in a generalized linear mixed model (GLMM).
We considered the reduction in our reanalysis to only one independent variable out of the eight and one dependent variable out of the three that Martin et al. (2013) used would lead to a valuable reduction of complexity, particularly given that no other study has yet analyzed the relationship between measurement invariance and predictive invariance in largescale assessment study data. Therefore, nothing is known about possible interaction or compensatory effects in situations where the relationship between measurement invariance and predictive invariance affects more than one latent variable. We believed a reduced model would consequently increase the likelihood of finding such effects in the PIRLS/TIMSS data sets.
Also, because the selection of the HRL indices is somewhat arbitrary, we decided it would make sense to concentrate on the HRL variable. Many largescale assessment studies have shown that the crossnational assumption of measurement invariance is unlikely to hold for social background variables see, for example, (Caro and SandovalHernandez 2012; Hansson and Gustafsson 2013; Lakin 2012). Therefore, rescaling the HRL in a way that assumes measurement noninvariance would be consistent with the findings of this prior research. In addition, it is plausible to assume that indicators of the HRL indices will show countryspecific characteristics. For example, the indicator “students have own room at home” could, in some countries, be a very important indicator with respect to differentiating students with many home resources from students with only a few home resources. However, in most of the countries participating in PIRLS/TIMSS 2011, this indicator was unlikely to be a strong one because nearly all of the students had their own room at home. In terms of the IRT approach, this indicator should therefore show crossnational variation in the discrimination parameter.
It is useful at this point to outline the procedures on which the study of Martin et al. (2013) was based, especially those used to scale the HRL indices. This explanation may seem unnecessary given the wealth of literature on IRT models, but we consider it necessary for two reasons. First, Martin et al. did not explicitly use the term measurement invariance in their report. We are therefore left with the notion that they simply assumed there was measurement invariance. Second, a clear description of the scaling model they used is required to illustrate why we deemed it necessary to use a modified version of this model in our reanalysis. We also considered it necessary to introduce the prediction model.
Scaling procedures used to develop the HRL index
Martin et al. (2013) used, as indictors for the HRL index, three items from the PIRLS 2011 home questionnaire (the “Learning to Read Survey”) given to the parents of the students who participated in the study, and two items from the PIRLS 2011 student questionnaire. The home questionnaire items were “number of children’s books in the home,” “highest level of education of either parent,” and “highest level of occupation of either parent.” The student questionnaire items were “number of books in the home” and “number of home study supports” (see Table 1). The PIRLS and TIMSS studies use these items as indicators of the economic and cultural capital of students’ families (Mullis et al. 2012b). The positive association between these indicators and student achievement are evident in many of the reported findings from largescale studies of educational achievement (see, for example, Martin et al. 2008, 2012; Mullis et al. 2007, 2008, 2012a; OECD 2014a). In line with Bourdieu ’s (1986) work on cultural capital, the HRL index can thus be interpreted as a measure of students’ socioeconomic and cultural home learning environments (Smith et al. 2016).
Determining an appropriate scaling procedure for the HRL index presented Martin et al. (2013) with a statistical challenge. They decided to use the partial credit model (Masters 1982; Wright and Masters 1982) to derive the index. That is, assuming \(i=1, \ldots , p\) are p items with \(k_i=0, \ldots , m_i\) response levels, then
gives the probability of a response in category \(k_i\) of item i for a person j (\(j=1, \ldots , n_g\)) in group g (\(g=1, \ldots , G\)) with the latent value \(\theta _{gj}\) and an item i with a groupspecific item parameter vector \(\varvec{\xi '}_{\varvec{gi}}=\begin{pmatrix} \tau _{g0i}&\ldots&\tau _{gm_ii}\end{pmatrix}\), where \(\tau _{gti}\) is the tth threshold location of item i in group g on a latent continuum. For identification purposes, it is usually assumed that \(\tau _{g0i}=0\) for all g and i. In addition, applications of the partial credit model frequently assume local independence. Accordingly, given the value of \(\theta _{gj}\), the item responses should be conditionally independent. This means that
where \({\varvec{x'}_{\varvec{gj}}}=\begin{pmatrix} x_{g1j}&\cdots&x_{gpj}\end{pmatrix}\) is the response vector of person j in group g and \(\varvec{\Xi }_{\varvec{g}}\) is a \(i \times p\) blockdiagonal matrix, with the item parameter vectors \({\varvec{\xi '}_{\varvec{g1}}} \cdots {\varvec{\xi '}_{\varvec{gp}}}\) in the diagonal.
Different procedures exist for the estimation of \(\varvec{\Xi }_{\varvec{g}}\) and \(\theta _{gj}\), given the observed data \({\varvec{X}}_{\varvec{g}}={(x_{ij})}_g\) (Fischer and Molenaar 1995). In order to estimate the item parameters (a procedure also know as item calibration), Martin et al. (2013) used the marginal maximum likelihood approach. According to this approach, the marginal likelihood of \(\varvec{X}\) in group g is
This likelihood is maximized with respect to \({{\varvec{\Xi }}_{{g}}}\), where \(\phi _g(\theta )\) is the population density function for \(\theta\) in group g (in the case of the HRL index, it was assumed that \(\theta \sim \text{ N }_g(0,1)\)).
For the calibration of the item parameters, Martin et al. (2013) used the combined data from the 37 countries participating in both TIMSS and PIRLS 2011, with each country contributing equally to the calibration. This was achieved by weighting each country’s student data to sum up to 500. The item parameters across groups were therefore fixed \({\varvec{\Xi }_\mathbf{{1}}}=\cdots = {\varvec{\Xi }_{\mathbf{37}}}\), which also meant that \(\text{Pr}_1({\varvec{x'}}\theta )=\cdots =\text{Pr}_{37}({\varvec{x'}}\theta ).\) Hence, Martin et al. assumed, with respect to the HRL index, measurement invariance across participating countries. If we assume that the item responses are conditionally independent across groups, then the marginal likelihood of \(\varvec{X}\) would be
where \(w_j\) is a personspecific weighting factor so that each country’s student data sums up to 500.
Once the items have been calibrated by maximizing Eq. (1) with respect to \({\varvec{\Xi }}\), estimators of \(\theta\) can be observed by maximizing the weighted likelihood function,
where \(g(\theta )\) is a function of the first and second partial derivatives of \(L({\varvec{x'}}\theta ,{\varvec{\Xi }})\) with respect to \(\theta\). The aforementioned equation is known as weighted likelihood estimation, and the resulting estimator is called Warm’s likelihood estimator (WLE; Warm 1989), which has been shown to produce less bias than the unweighted maximum likelihood estimator of \(\theta\).
The prediction model used to explain student achievement
To explain the achievement differences among the fourthgrade students participating in PIRLS/TIMSS 2011, Martin et al. (2013) used the WLE estimators \({\hat{\varvec{\theta '}}_{\varvec{g}}} = \begin{pmatrix} \hat{\theta }_{g1}&\cdots&\hat{\theta }_{gn_{g}} \end{pmatrix}\) of the HRL index and the average of two other indices—“early literacy tasks” and “early numeracy tasks”—as predictors in their countryspecific hierarchical linear models (which they called the Home Background Control Model). For example, for a given country g, let \(y_{us}\) be the achievement value of student u in school s (\(s=1, \cdots , N_g\)), \(\hat{\theta }_{Hus}\) be the corresponding value on the WLE estimate of the HRL index, and \(\hat{\theta }_{Eus}\) be the average value of the early literacy tasks and the early numeracy tasks indices. The combined model for explaining achievement is therefore
In this equation, \(\gamma\) represents the intercept and the fixed effects of the predictors, \(\alpha\) are random effects representing variation in the fixed effects across schools, \(\hat{\theta }^*\) are the school meancentered WLEs, and \(\hat{\bar{\theta }}\) are the school average of the respective WLEs. Note that the u and r are error terms associated with the school and the individual. Note also the assumption that y and r are normally distributed (Raudenbush and Bryk 2002).
We should mention, however, that Martin et al. (2013) only includes the random effects when there was significant variation in the relationship between the WLEs and achievement across schools and only when they could estimate this relationship reliably. Furthermore, they usually used the variance components \(\sigma ^2_{\alpha }=\text{var}(\alpha )\) and not the coefficients for \(\alpha\) to estimate these effects. In addition, because Martin et al. used plausible values for y, they performed all analyses five times and averaged the results according to Rubin’s formulas (Rubin 1987).
Comments on Martin and colleagues’ procedures
In order to address the challenges identified above, the construct underlying the HRL index needed to be based on a coherent and robust theoretical framework. Such a framework can indeed be derived by drawing on various conceptualizations of capital (Bourdieu 1986; Coleman 1988). However, because the HRL index drew on only five indicators (from the many available), it was very narrowly defined. We consider that the index would have particularly benefited from inclusion of the more reliable and valid indicators of social reproduction (Caro et al. 2014). Martin et al.’s (2013) assumption of measurement invariance also merits consideration for two reasons. First, because crossnational and comparative research in various disciplines challenges the validity of this assumption (Çetin 2010; Caro et al. 2014; Hansson and Gustafsson 2013; Schulte et al. 2013; Schulz 2005; Segeritz and Pant 2013). We assumed that at least some of the HRL indicators would show differential item functioning across the participating countries. For example, having an internet connection and/or a room of one’s own may be more discriminating indicators of social status among students in southern or eastern European countries than among students in central European countries. Also, it seems prudent to conceptualize highest level of occupation of either parent in terms of the characteristics of each country. For example, a small business ownership might represent high social status in some countries but denote a broader category representing both lower and middle social status in other countries. These considerations suggest that the apparent lack of research studies on the invariance of the HRL index across countries needs to be remedied.
The second reason why critiquing the assumption of measurement invariance is critical relates to the general inconsistency of measurement invariance and predictive invariance shown in the work by Millsap (1995, 1997, 1998, 2007). Assuming that the HRL index presents no measurement invariance across countries, then the implication of that assumption is that the variance of the coefficients of the hierarchical linear model across countries is a purely methodological artifact. In addition, where this methodological variance does exist, then, according to generalizability theory (Brennan 2001) it should be added to the actual variance of the coefficients across countries (by, for example, increasing the standard errors of the coefficients). However, enacting this proviso is difficult because the size of the effect between measurement invariance and predictive invariance is presently unclear. The same can be said of the relationships between different degrees of measurement invariance, different measurement models, and other (more general) prediction models.
The concerns we have expressed here led to the following research questions:

To what extent can measurement invariance across participating countries be assumed for the HRL index of Grade 4 students assessed in the combined PIRLS and TIMSS studies of 2011?

If the assumption of measurement invariance does not hold, to what extent do countryspecific measurement models differ?

Is there an effect of different degrees of measurement invariance on the parameter estimates of the prediction model?

If there is an effect, how large is it?
In an effort to answer these questions, and as already indicated, we reanalyzed some of the data that Martin et al. (2013) used in their school effectiveness study.
We began by addressing the first research question. Here, we fitted two different measurement models with two different degrees of measurement invariance to the combined data and then used wellestablished fit criteria to compare the resulting models. To answer the second research question, we compared the discrimination parameters of the measurement models across countries, a procedure that allowed us to derive the countryspecific measurement validity of the indicators. In order to answer the third research question, we introduced the different HRL indices as predictors in generalized linear mixed models (GLMMs) where mathematics achievement was the dependent variable. By comparing the regression coefficients across countries and across different measurement models, we were able to observe both the overall effect of different degrees of measurement invariance on the prediction coefficients and the countryspecific effect on the coefficients. We also analyzed the variance component, that is, the random part of the hierarchical linear model, in the same manner as we analyzed the regression coefficients. A fuller explanation of how we conducted our analyses follows.
Methods
Data
We used the combined international data sets for all countries participating in PIRLS/TIMSS 2011.^{Footnote 1} We then drew from these data sets, the countryspecific data files named ASG***B1 and ASH***B1: *** stands for a countryspecific code, ASG are the fourthgrade student background data sets and ASH are the corresponding home background data sets.^{Footnote 2} Our next step was to merge the different data sets, first according to countries and then according to data resources. This process resulted in a dataset that included the student background data and the home background data for \(n=166,709\) Grade 4 students across 37 participating countries.
Scaling procedure
We used Muraki’s (1992) generalized partial credit model to scale the HRL index. We decided to apply this model instead of the partial credit model used by Martin et al. (2013) because it allows for modeling the different discrimination parameters of the indicators. Opportunity to model different discrimination parameters seemed to us especially important given the number of studies that show that the discrimination parameters of different social indicators vary across countries (see section "Comments on Martin and colleagues’ procedures" section). According to the generalized partial credit model, the probability of a response in category \(k_i\) (\(k_i=0, \ldots , m_i\)) of item i (\(i=1, \ldots , p\)) for a person j (\(j=1, \ldots , n_g\)) of group g (\(g=1, \ldots , G\)) is
with the latent value \(\theta _{gj}\) and the item parameter vector \({\varvec{\xi '}_{\varvec{gi}}}=\begin{pmatrix} \alpha _{gi}&\tau _{g0i}&\ldots&\tau _{gm_ii}\end{pmatrix}\), \(\tau _{gti}\) is the tth threshold location of item i in group g, and \(\alpha _{gi}\) are the groupspecific discrimination parameters of item i on a latent continuum. For identification purposes, it is usually assumed that \(\tau _{g0i}=0\) for all g and i. The partial credit model that Martin et al. used can be seen as a special case of the generalized partial credit model, with \(\alpha _{gi}=c\) for all i and g (normally \(c=1\)). However, the generalized partial credit model we used allowed different discrimination parameters between the items i and between the groups g.
We used the item response function (3) to estimate four different measurement models, each with different degrees of measurement invariance for the HRL index.

1.
Model 1: In this model, all discrimination parameters \(\alpha _{gi}=c\) and all \(\tau _{gki}=\tau _{ki}\) were held constant both between the items and across the countries whereas the threshold parameters were allowed to vary between items but remain constant across countries. This model was the same as the one used by Martin et al. (2013).

2.
Model 2: In contrast to Model 1, the discrimination parameters \(\alpha _{gi}=c_g\) were held constant between the items but allowed to vary across the countries. However, the assumptions for the threshold structure were the same as those for Model 1.

3.
Model 3: Here, discrimination parameters \(\alpha _{gi}=c_i\) were allowed to vary between the items but were held constant across the countries. Again, the threshold structure remained unchanged.

4.
Model 4: All discrimination parameters \(\alpha _{gi}=c_{gi}\) were allowed to vary both between the items and across the countries. As before, the threshold structure remained unchanged.
According to this design, Model 1 was the most restrictive model because it assumed strict measurement invariance across the countries. Model 4 was the least restrictive model because it allowed for countryspecific measurement models (at least with respect to the item discrimination parameters \(\alpha _{gi}\)).
Table 1 depicts the items i, with their corresponding names in the international data sets, that were used for scaling the HRL index. We used the marginal maximum likelihood approach to calibrate the item parameters of the four models. After estimating the parameters, we used the maximum a posterior probability (MAP) estimate to generate the scores for \(\theta\). The following formula describes the corresponding posterior distribution of \(\theta\):
Generally, this procedure results in more efficient estimates of \(\theta\) than the WLE approach, especially when there are only a few items to scale (\(p\le 10\); Wang and Wang 2001). However, the MAP bias seems slightly greater than the bias of the WLEs (at least under some circumstances). Overall, this procedure made it possible to derive four estimates of \(\theta\) for every student.
Prediction model
We used the generalized linear mixed model (GLMM) as the prediction model (Zeger and Karim 1991; Karim and Zeger 1992). We chose the GLMM as the framework rather then the hierarchical linear model applied by Martin et al. (2013) because crossnational comparisons of the fixed effects from the GLMM require use of a test statistic. However, the statistic we needed was not yet available, so we developed one as part of this study. Provision of the mathematical proof of this statistic, which we based on the GLMM, is beyond the scope of this paper. We have therefore covered this matter in a separate paper (see Kasper 2017). We also selected the GLMM because the hierarchical linear model is a special case of it, which means that nothing is lost when this framework is used. Use of the GLMM framework furthermore makes it easier for readers to follow the development and proof of the test statistic in Kasper (2017), and thus check the validity of our application of this test statistic in our current study. In order to use this very general prediction model [for a detailed description of it, see, McCulloch and Searle (2001)] in our study, we needed to simplify some aspects of it. For example, because we used the plausible values of the Grade 4 students’ mathematics achievement as the dependent variable and assumed the random effects were normally distributed, we could also assume that the dependent variable \(y_g\) was approximately normally distributed in accordance with the assumptions made during generation of these plausible values (Martin and Mullis 2012). This approach led to a GLMM with identity link function \(g(\cdot )\), which meant that \(\varvec{\eta }_{\varvec{g}}={g}({\text{E}}(\varvec{y}_{g}))=\text{ E }({\varvec{y}}_{{g}})\) and
Here, \(\varvec{y}_{\varvec{g}}\) is a \(n_g \times 1\) vector with the plausible values on mathematics achievement as the dependent variable; \(\varvec{X}_{\varvec{g}}\) is a \(n_g\times 5\) matrix with the school meancentered values and the school average values of \(\theta _{Hg}\) and \(\theta _{Eg}\) in the columns (plus a constant vector of 1s for the intercept); \(\varvec{\beta }_{\varvec{g}}\) is a \(5 \times 1\) vector with the corresponding fixed effects; \(\varvec{Z}_{\varvec{g}}\) is a \(n_g \times 2s\) block matrix with two blockdiagonal matrices each of size \(n_g \times s\) in the columns representing the random predictors; \(\varvec{\alpha }_{\varvec{g}}\) is a \(2s \times 1\) vector with the corresponding random effects; and \(\varvec{e}_{\varvec{g}}\) is a \(n_g \times 1\) vector of residuals.
Estimation of the coefficients of this model requires use of the pseudolikelihood approach. However, due to the distributional assumptions about the dependent variable, \(\varvec{y}_{\varvec{g}}\) can be used in the pseudolikelihood approach instead of the working variate \(\varvec{t}_{\varvec{g}}\). This alternative use results in a real objective function \(l({\varvec{\theta} }_{{g}},{\varvec{y}}_{{g}})\). The derived pseudolikelihood estimates in our study were therefore formally equivalent to the restricted likelihood estimates of the fixed and random effects that Martin et al. (2013) derived in their study. Also, because we wanted to analyze the influence of different scaling procedures for the HRL index on the GLMM results, we introduced only the intercept and the slope of the HRL in the model as random effects. This meant that, unlike the study by Martin and colleagues, our study did not include a random slope for the early literacy/numeracy task indicator. However, the random effects could still be correlated and, given the random effects, it could then be assumed that the schools were independent, resulting in
where \(\otimes\) is the Kronecker product and \(\varvec{I}_{{gs}}\) is a identity matrix of order \({s}\).
Outcomes
Scaling models
In order to compare the scaling models, we calculated the loglikelihood, the Bayesian information criterion (BIC; Schwarz 1978) and Akaike’s information criterion (AIC; Akaike 1974) for each of the four models. We also calculated the variance of \(c_g\) across countries, the variance of \(c_i\) across items, the variance of \(c_{gi}\) across items (given country g), and the variance of \(c_{gi}\) across countries (given item i):
To test the hypotheses that these variances would be equal to zero, we used the \(\chi ^2\)test. We also calculated the asymmetric confidence intervals for the different variance estimations. Thus, if \(\text{H}_{0}:\, \sigma ^2_k=t\) and \(s^2_k\) is an estimate of \(\sigma ^2_k\), then
with v degrees of freedom. However, because \(t=0\) is not a testable assumption, it was necessary to choose small values \(t>0.000\) for the respective \(\chi ^2\)calculations.
Comparison of the conditional variances \(s^2_{c_{gig}}\) and \(s^2_{c_{gii}}\) required use of two further approaches. The first involved calculation of the overall variances
and then (by using the abovementioned \(\chi ^2\)test and confidence intervals) testing of the hypothesis \(\text{H}_{0} : \sigma ^2_{c_{gi.g}}=\sigma ^2_{c_{gi.i}}=0\). The second approach, used whenever the results of these overall tests were significant, required multiple comparisons of \(s^2_{c_{gig}}\) across countries and of \(s^2_{c_{gii}}\) across items. We performed these comparisons by using \(\left[ G!/(G2)!2!\right]\)times and \(\left[ p!/(p2)!2!\right]\)times the Fratio:
with \(K:=\left\{ 1, \ldots , G\right\}\) and \(L:=\left\{ 1, \ldots , p\right\}\), assuming that the variances are ordered by decreasing size.
Prediction model
To obtain an indication of the effect that the different scaling models had on the fixed and random effect coefficients of the GLMM, we performed different analyses. We based the analyses for the fixed effects on F and \(\chi ^2\) tests. Thus, if \(\hat{\varvec{\beta }}_{\varvec{gz}}\) are the estimated fixed effects for country g and scaling model \(z (z=1,\ldots , 4)\), then the hypothesis that a linear combination of the difference of the fixed effects between two scaling models w and \(q\ (w\not =q)\) equals a constant value \(\varvec{m}\), that is \(\text{H}_{0}: \varvec{L}_{\varvec{g}}(\varvec{\beta }_{\varvec{gw}}\varvec{\beta }_{\varvec{gq}})={\varvec{m}}\), can be tested with
where \(\hat{\varvec{\beta} }_{\varvec{diff}}=\hat{\varvec{\beta} }_{\varvec{gw}}\hat{\varvec{\beta} }_{gq}\) and \(\hat{\sigma _g}^2 = (\hat{\sigma _g}^2_w+\hat{\sigma _g}^2_q)/2\) is the pooled residual variance estimate for the separate GLMM models w and q. Under the null hypothesis, the test statistic is noncentral Fdistributed with \(\text{ r }(\varvec{L}_{\varvec{g}})\) and \(n_g\) degrees of freedom [the proof is given in Kasper (2017)]. The Fstatistic is calculated for each country separately under the assumption that the difference of the fixed effects between each nonredundant pair of scaling models is zero, that is, \(\varvec{L}=\varvec{I}\) and \(\varvec{m}=\varvec{0}\).
In addition to analyzing the global tests of significant difference between the fixed effects, we analyzed the variances of the respective fixed effects across scaling models (given a country) and the variance of the fixed effects across countries (given a fixed effect). Thus, if \(\hat{\beta }_{jgz}\) is the estimated fixed effect for predictor \(j\ (j=1, \ldots , 5)\) in country g given scaling model z, then the variance
is calculated for every combination of j and g. The hypotheses \(\text{H}_{0}: \sigma ^2_{\hat{\beta }_{jgz.g}}=t\) are then tested with \(\chi ^2_v = vs^2_{\hat{\beta }_{jgz.g}}/t\), where \(v=41\) are the respective degrees of freedom for this test. We next calculated the variance of the fixed effects across countries (given scaling model z). Here, the variances
are separately calculated for every combination of j and z, and then the hypotheses \(\text{H}_{0}:\sigma ^2_{\hat{\beta }_{jgz.z}}=t\) are tested with \(\chi ^2_v = vs^2_{\hat{\beta }_{jgz.z}}/t\), where \(v=G1\) are the respective degrees of freedom for this test.
As with the analysis of the slope coefficients, whenever significant results emerged from these overall tests, we performed multiple comparisons of \(s^2_{\hat{\beta }_{jgz.g}}\) across countries and of \(s^2_{\hat{\beta }_{jgz.z}}\) across fixed effects by using \(\left[ G!/(G2)!2!\right]\)times and \(\left[ j!/(j2)!2!\right]\)times the Fratio
with \(K:=\left\{ 1, \ldots , G\right\}\) and \(L:=\left\{ 1, \ldots , p\right\}\), assuming that the variances are ordered by decreasing size.
We used structural equation models to analyze the random effect coefficients. Here, the hypothesis that the covariance matrices of the random effect coefficients \(\varvec{D}_{\varvec{g}}=\varvec{G}_{\varvec{g}} \otimes \varvec{I}_{\varvec{gs}}\), given a country g is equal across scaling models, that is, \(\text{H}_{0}: \varvec{G}_{\varvec{g1}}= \cdots = \varvec{G}_{\varvec{g4}}\), can be tested by calculating the overall discrepancy function value
with the restriction \(\boldsymbol{\varSigma} _{\varvec{g1}}= \cdots = \boldsymbol{\varSigma} _{\varvec{g4}}\) and \(t_{gz}=(n_g1)/(4n_g4)\). Under the null hypothesis, the overall discrepancy function value is approximately chisquare distributed \({\chi ^2_F} \approx {v} {F_g}(\varvec{\theta} )\) with v degrees of freedom. Significant \({\chi ^2_F}\) statistics therefore lead to rejection of the hypothesis that the scaling procedure has no influence on the random effect coefficients.
Dealing with missing values, weighting and software
We used a Markov chain Monte Carlo (MCMC) method to impute missing values in the indicators of the HRL indices. The imputation model included all indicators of the HRL indices and the plausible values of mathematics achievement, and so produced five complete data sets. Of course, a fully nested imputation strategy would have resulted in 25 imputed data sets (e.g., for each plausible value, five imputed data sets). However, because Martin et al. (2013) applied only a single imputation strategy (which seemed to us an inaccurate approach of conducting an analysis involving analysis of the variance), an increase from 1 to 25 imputations would have made it impossible to compare the results of this current paper with Martin and colleagues’ results. Every analysis in our study was performed once for every completed dataset, and then the results were averaged according to Rubin’s (1987) formula. Senwgt was used as the weighting variable for the scaling models. Senwgt summed up to a total sample size of students \(n_g=500\) for every country and so led to the equal weighting of the countries in the scaling process. The GLMM analysis, however, uses houwgt, which sums up to the observed sample size of students for every country. Unless we state otherwise in this paper, all the analyses in our study were generated by way of Statistical Analysis System (SAS) software, Version 9.4 (TS1M1) of the SAS System for Windows.^{Footnote 3} We used the procedure MI to carry out the multiple imputations, the procedure IRT to scale the HRL index, the procedure GLIMMIX for the GLMM analysis, and the procedure CALIS for the structural equation models. We used the IMLmodule insight of SAS to implement the derived test statistics.
Results
Descriptive statistics
Table 1 shows the percentage of yes responses on the HRL scale items for the total sample of Grade 4 students \(n=138,103\).^{Footnote 4} Overall, the responses of the students were equiproportionally distributed across the response category of the items. However, a highly skewed distribution was evident for the indicator “number of home study supports”: over 50% of the students had both an internet connection and their own room at home. Thus, for the majority of the students, this indicator provided no useful information. Also noteworthy is the relatively low percentage (7.9%) of parents who had completed only some primary or lowersecondary education or who had not attend school.
In order to verify that we had correctly implemented the scaling models, we replicated the original HRL index that Martin et al. (2013) used. Table 2 presents the descriptive statistics for these replicated values together with the newly created HRL indices, average student mathematics achievement scores, student sample sizes, and school sample sizes. The correlation between the original HRL index and the replicated HRL index (RP) was \(r=0.97\), suggesting that the scaling models were correctly implemented in this study (Table 3 shows the correlations between the other indices).^{Footnote 5}
When we compare the average values on the different HRL indices across the scaling models, we observed, on average, only small changes between the different indices per country. However, some noteworthy exceptions were apparent. These included changes of around 0.3 points for Germany, Honduras, Hungary, and Poland. Hence, for these countries, the influence of the scaling model on the average HRL indices was approximately onethird of a standard deviation of this index. For Malta, the influence of the scaling model on the average HRL indices was even more pronounced, at approximately twothirds of a standard deviation of the HRL index.
Scaling models
We based our assessment of the accuracy of the four different measurement models used to scale the HRL index on three criteria: the loglikelihood (the higher the value, the better the fit), the AIC, and the BIC (the smaller the value, the better the fit). According to these criteria, the model that best fitted that data was the least restrictive scaling Model—Model 4 (Table 4). We observed virtually no difference for Models 2 and 3. Model 1 (strict measurement invariance across the countries) had the worst fit. The analyses therefore support the assumption of countryspecific scaling models for the HRL index and challenge the assumption of crossnational invariance of the HRL index.
With respect to the differential estimation of the fit of the four models, Table 5 shows the distribution of the varying discrimination parameters \(c_g\), \(c_i\) and \(c_{gi}\). When strict measurement invariance was assumed (Model 1), the estimated discrimination parameter was \(c=1.55\). When the discrimination parameter was allowed to vary across countries but was still constant between items (Model 2), cross country variance in this parameter (\(c_g\)) was observed (\(s^2_{c_g}=0.31; CI_l=0.19, CI_u=0.57\)). In some countries (e.g., Australia, Ireland, Morocco, Romania), the HRL index measured the underlying construct with a higher degree of separation when a more countryspecific scaling model was used. In other countries (e.g., Czech Republic, Georgia, Germany, Malta, Qatar, Slovenia), the differentiation became less distinct. Hence, in the first instance, the original HRL index underestimated the difference in HRL for Grade 4 students whereas in the second instance the original HRL index overestimated this difference.
With regard to the assumption that the contribution of the HRL items to the HRL index would vary while the influence of the items remained constant across countries (\(c_{gi}\)), we found that the indicator “number of home study supports” was least informative with respect to the measured construct. This result supports the findings from the descriptive statistics: having a connection to the internet and/or one’s own room at home seem to have been standards and not exceptions for the fourthgrade students both within and across the countries participating in PIRLS/TIMSS 2011. The educational status of the students’ parents best explained the differences in the HRL index. The duality between parents’ educational status and number of home study supports increased when the countryspecific measurement models (\(c_{gi}\)) were assumed (Model 4). In this case, parents’ highest educational level contributed to the HRL index in most countries approximately two to four times more than the number of home study supports did. This finding suggests that the original HRL index did overestimate the influence of all indicators, with the exception of “highest level of education of either parent” (the influence of which, in turn, was underestimated).
However, if we take a closer look at the distribution of the itemspecific discrimination parameters across countries, that is, the variance of \(c_{gi}\) given item i, then it becomes obvious that the strong discriminating effect of parental highest educational level was not constant across countries (Table 6). The discrimination parameter was exceptionally high for Australia, Iran (Islamic Rep. of), Ireland, Malta, Morocco, Oman, Qatar, Saudi Arabia, Spain, and Abu Dhabi (United Arab Emirates; UAE) and lowest for Chinese Taipei and Honduras. Despite this indicator working very well for most (if not all) countries, it worked better in some of these countries than in others. The reverse was also observable for the low discriminating power of the number of home study supports: overall, this indicator differentiated poorly among Grade 4 students. Nonetheless, we could still observe a slight discrimation capacity in some countries (i.e., Australia, Chinese Taipei, Ireland, Morocco, Oman), although virtually no discriminating capacity in several other countries [i.e., Georgia, Germany, Hungary, Malta, Qatar, Singapore, Slovenia, Spain, Abu Dhabi (UAE)]. The psychometric property of the indicator “highest level of education of either parent” exhibited the strongest discriminating capacity across most countries. These findings can perhaps be attributed to challenges to the crossnational validity of these indicators.
Finergrained detail about the countryspecific discriminating power of the HRL indicator became evident when we inspected the variance of the discrimination parameter \(c_{gi}\) across items given country g (Table 7). We observed highly differential discrimination parameters for the items for Qatar, Australia, Iran (Islamic Rep. of), Malta, Abu Dhabi (UAE), Spain, Poland, and Morocco. In these countries, parental highest educational level had the strongest influence on the HRL index. However, in most of the remaining countries (around twothirds), the variance across the estimated item discrimination parameters was moderate or even low, indicating that the assumption of a onedimensional construct for the HRL index was acceptable for these countries. Nevertheless, the observed significant difference in \(s^2_{gig}\) across the countries participating in PIRLS/TIMSS 2011 again confirms the assumption of measurement noninvariance of the HRL index, with that noninvariance apparently mostly attributable to the indicators of the highest level of education of either parent and the number of home study supports.
Prediction model
Figure 1 shows the distributions of the estimated fixed effects across countries for the different scaling models. Noticeably, there were no differences in the distribution for the fixed effects \(\hat{\beta }_0\), \(\hat{\beta }_1\) and \(\hat{\beta }_2\). It seems that the different scaling procedures used for the HRL index left untouched all the fixed effects that were not associated with the HRL index. However, the effects of the scaling model on the distribution of the fixed effects across countries could be observed for those coefficients associated with the HRL index, either on an individual level (\(\hat{\beta }_3\)) or on the school level (\(\hat{\beta }_4\)). The scaling models thus affected both the mean and the variance of the distribution.
When conducting a statistical comparison of the distribution, we used a global Ftype statistic in the first step. However, none of the \(G\times z!/2!(z2)!=168\) derived F values were statistically significant. Thus, the overall hypotheses \(\text {H}_{\mathbf{0}}: \varvec{L}_{\varvec{g}}(\varvec{\beta }_{\varvec{gw}}\varvec{\beta }_{\varvec{gq}})={\mathbf{0}}\) cannot be rejected in any of the cases. This finding corresponds with the invariance of the observed distribution of the fixed effects \(\hat{\beta }_0\), \(\hat{\beta }_1\) and \(\hat{\beta }_2\) across scaling models: when three out of five fixed effects are virtually unaffected by the scaling procedure, no overall effects (as measured by the Ftype statistic) can be expected. When we took a closer look at the results emerging from the use of the variance of the different estimated fixed effects across scaling models given the country, that is \(s^2_{\hat{\beta }_{jgz.g}}\), we found virtually no variation across the models for the estimated fixed effects \(\hat{\beta }_0\), \(\hat{\beta }_1\), and \(\hat{\beta }_2\). We can therefore assume that this lack of variation explains the results of the Ftype statistic.
However, for those fixed effects that were associated with the HRL index (\(\hat{\beta }_3\) for the individual effect of the HRL index on mathematics achievement and \(\hat{\beta }_4\) for the schoollevel effect of HRL), we found that the scaling procedure had a strong influence. Table 8 shows the variance of the estimated fixed effect \(\hat{\beta }_3\) across scaling models calculated for each country separately. As can be seen, for each country, the measurement model used to scale the HRL index did influence the size of the estimated fixed effect. The effect was remarkably high for Iran (Islamic Rep. of), Malta, Slovenia, Czech Republic, Abu Dhabi (UAE), Qatar, and Romania: the estimated fixed effects changed by up to 10 points when we used a countryspecific measurement model to scale the HRL index. The direction of this change was not always the same, however, for some countries (Malta, Slovenia, Czech Republic), the estimated fixed effects decreased from measurement Model 1 to measurement Model 4; for others (Iran (Islamic Rep. of), Abu Dhabi (UAE), Qatar, Romania), the fixed effects increased.
Overall, the variance in the estimated fixed effect \(\hat{\beta }_3\) across countries (with the scaling model held constant) decreased from \(s^2_{\hat{\beta }_{3g1.1}}=135.82\) to \(s^2_{\hat{\beta }_{3g4.4}}=102.71\) when we used the countryspecific measurement models for the HRL index instead of the measurement invariance model. The differences across the countries in the observed association between the HRL index on the individual level and mathematics achievement reduced by approximately 30% when noninvariance models were used to scale the HRL index. However, for some countries (Chinese Taipei, Finland, Sweden), the influence of the scaling model on the estimated fixed effects \(\hat{\beta }_3\) was very low. This finding was not surprising because the countryspecific measurement model for these countries strongly agreed with the measurement invariance model (with the exception of the indicator “number of home study supports”). As such, no variation between the fixed effects should have been observed.
Table 9 displays the distribution of the schoollevel effects of the HRL index \(\hat{\beta }_4\) across the scaling models. As observed for the individual effect of the HRL index, the scaling model influenced the size of the GLMM coefficients for all countries. The effect was largest for Morocco, Honduras, Iran (Islamic Rep. of), Qatar, Malta, Czech Republic, Romania, and Abu Dhabi (UAE). For these countries, the scaling model had an impact on \(\hat{\beta }_3\) and \(\hat{\beta }_4\). In addition, the effects followed the same pattern. For example, when the estimated coefficient of \(\hat{\beta }_3\) decreased from scaling Model 1 to scaling Model 4, the coefficient from \(\hat{\beta }_4\) also decreased from Model 1 to Model 4. However, the variance across countries in the estimated slope parameter \(\hat{\beta }_4\) increased slightly from scaling Model 1 to scaling Model 4 (\(s^2_{\hat{\beta }_{4g1.1}}=577.60\) to \(s^2_{\hat{\beta }_{4g4.4}}=597.73\)). Again, for those countries for which Model 4 strongly corresponded with Model 1 (i.e., Chinese Taipei, Sweden, Finland) virtually no variation between the fixed effects could be observed.
Our final step involved an analysis of the impact of the scaling procedure on the random effects of the GLMM. Tables 10 and 11 depicted the distribution of the \(\varvec{G}\) matrices across the countries and scaling models. Table 12 presents the fitvalues of the applied structural equation models. With the exception of Sweden, the applied scaling model affected the random coefficients of the GLMM in every country. The impacts were highest for Morocco, Malta, Honduras, and Iran (Islamic Rep. of), and lowest for Australia, Chinese Taipei, Finland, Ireland, Poland, and Sweden. Hence, there seems to be a weak relationship between the influence of the scaling model on the fixed effects and the random effects, in the sense that small impacts on the fixed effects (e.g., for Chinese Taipei, Finland, Poland, Sweden) correlated slightly with small impacts on the random components of the GLMM. Nevertheless, the impact of the scaling model on the random effects, and thus on the institutional variation of the estimated relationship between the HRL index and mathematics achievement on the studentlevel, was remarkably high.
Discussion
This paper investigated the relationships between different procedures for scaling the “home resources for learning index” (HRL) and the prediction accuracy of this index in explaining the mathematics achievement of the fourthgrade students who participate in IEA’s combined PIRLS/TIMSS survey of 2011. As work by Lüdtke et al. (2011) and van den HeuvelPanhuizen et al. (2009) has shown, scaling social background indicators into a latent variable enhances the validity of largescale educational assessment studies. The content validity and the reliability of such an index are usually much higher than those of single indicators. Because both aspects are particularly important within the context of crossnational comparative studies of educational achievement, using a scaled index for PIRLS/TIMSS home environment (social background) variables provided a framework that enabled meaningful crossnational comparisons.
While the scaling of the social background indicators into a latent variable is without dispute, and probably without a reasonable alternative, the assumption of measurement invariance evident in scaling the HRL index needs to be challenged. As prior research on the scaling of social background indicators into latent indices in largescale assessments have shown, assuming a measurement invariance model across countries results in latent variables that are less reliable than those that occur when assuming measurement noninvariance (Caro and SandovalHernandez 2012; Hansson and Gustafsson 2013; Lakin 2012). In our study, rescaling the HRL index with four different measurement models with different degrees of assumed measurement invariance also showed that the measurement noninvariance model fitted the data best. Thus, with respect to our first research question we can assume that measurement invariance across participating countries for the HRL index would not hold for the Grade 4 students assessed in PIRLS/TIMSS 2011.
From a methodical perspective, we were not surprised to find that our less restrictive model (the measurement noninvariance model) was superior to our more restrictive model (the measurement invariance model) in terms of fitting indices. Everything else being equal, a model where the parameters can take on any value will always fit at least as well as a model where some of the parameters are fixed to some value or where some of the parameters are set to constraints. It could be argued that the measurement invariance assumption is merely a practical matter because it makes crossnational comparative studies of educational achievement possible through use of model that most parsimoniously describes the data yet also describes the data sufficiently well to explain any observed achievement differences. However, viewing this matter from the perspective of predictive validity challenges this argument. Given the general inconsistency of measurement invariance and predictive invariance that Millsap (1995, 1997, 1998, 2007) found, we could expect that the most parsimonious model (the measurement invariance model) for latent variables would affect ability to compare the prediction coefficients of this latent variable across countries. Accordingly, with regard to the HRL index, we need to establish whether the hierarchical linear model applied by Martin et al. (2013) was sensitive to the assumption of measurement invariance.
To investigate that question, we rescaled the HRL index four times, with each scaling allowing a different degree of measurement invariance. We then introduced these indices as predictors in a generalized linear mixed model (GLMM) with mathematics achievement as the dependent variable. Overall, we observed a strong influence of the scaling model on the prediction outcomes of the GLMM. Assuming countryspecific measurement models for the HRL index decreased the crossnational variance of the individual effect of the HRL index on student mathematics achievement. The variance across countries of this effect was \(s^2_{\beta _{3gz.z}}=135.82\) for the measurement invariance model. However, the strength of the effect dropped to \(s^2_{\beta _{3gz.z}}=102.71\) for the measurement noninvariance model. Accordingly, the crossnational differences of this effect, expressed in terms of the crossnational variance of \(\hat{\beta }_3\), can be reduced by approximately 25% when a measurement noninvariance model is assumed for the HRL index. This finding implies that those countries classified as unequal with respect to this effect when the measurementinvariance assumption applied, that is, Iran (Islamic Rep. of) and Slovenia, would be categorized as equal under the assumption of measurement noninvariance.
The results for the schoollevel effect of the HRL index were not as conclusive. Although we observed only a small difference in the crossnational variance of this effect when we compared the measurement invariance with the countryspecific and itemspecific measurement model (Model 1 vs. Model 4), we found the reduction in variance was substantial when a countryspecific (but not an itemspecific measurement model) was assumed (Model 2), or when an itemspecific measurement model (but not a countryspecific model) was assumed (Model 3). In both cases, the crossnational variance of the schoollevel effect of the HRL index reduced by about 11%. One explanation for these somewhat unpredictable results could be that the four HRL indices were scaled in the same way as in the study by Martin et al. (2013), that is, without taking the multilevel structure of the data into account. Loosely speaking, this possibility implies that the applied scaling procedure “ignored” the betweenschool part of the HRL index. Further research directed toward differentiating between a level one measurement invariance assumption and a level two measurement invariance assumption is needed. Nevertheless, application of the scaling procedure that Martin et al. used will result in schoollevel prediction effects of the HRL index that are obviously sensitive to the assumed degree of measurement invariance.
Although the effect of the measurement invariance assumption on crossnational comparisons of the fixed effects of the GLMM was the main focus of the present study, we also investigated countryspecific differences in the effect of the measurement invariance assumption on the prediction coefficients. We were not surprised to find this effect was not constant across countries. For example, the influence of the measurement model on both the individual and the schoollevel HRL coefficients was relatively strong in Iran (Islamic Rep. of), Malta, Czech Republic, Abu Dhabi (UAE), Qatar, and Romania, but was relatively weak in Australia, Saudi Arabia, Chinese Taipei, Finland, and Sweden. We can express this point in another way by stating that the regression coefficients for Finland, for example, were relatively robust with respect to the different assumptions about measurement invariance, while the coefficients for Iran (Islamic Rep. of) were very sensitive with respect to the assumed scaling model. The implication of this finding is that even when only the countryspecific regression coefficients are of interest, we need to take the assumed degree of measurement invariance into account when interpreting the coefficients.
We were also able to observe the countryspecific effects of the measurement invariance assumption on the prediction validity of the GLMM’s random slope coefficients. In most countries, the random variance of this coefficient decreased when a noninvariance model was assumed. The fact that we can interpret the random coefficient as a measure of the schoolspecific effect on the relationship between the individual HRL index and mathematics achievement, basically implies that, under the noninvariance model, differences between schools are a less suitable way of explaining the relationship between the HRL index and mathematics achievement. Accordingly, under the noninvariance assumption, we can expect that this relationship would be nearly the same in all schools of most of the participating countries, while under the measurement invariance model the relationship between the HRL index and mathematics achievement would vary across these schools. In short, researchers and others may draw completely different conclusions with respect to this effect because the nature of the effect will depend solely on the assumed measurement model.
The important point here is that the results of the hierarchical linear model that Martin et al. (2013) applied are very sensitive in terms of the assumed degree of measurement invariance. According to Millsap’s (1995, 1997, 1998, 2007) findings this degree of sensitivity can be expected. However, if researchers agree that using latent variables in educational research is sound practice, and if assuming measurement invariance is a necessary requirement for crossnational comparisons of latent variables, it is vital to consider the question of how researchers engaged in largescale assessment studies can control for these effects or take them into account.
While a comprehensive answer to this question will rely on further research and on more expertise, and although the research agenda of the IEAETS Research Institute calls for “a more scientific approach to the development, use and interpretability of background questionnaires” (http://ierinstitute.org/researchagenda.html, Accessed 04 May 2016), we can still offer some general ideas. For example, according to Brennan’s (2001) generalizability theory, the variance in the GLMM coefficients that can be traced back to different assumptions about measurement invariance should be added to the standard errors of these coefficients. In regard to the results of the present study, this advice implies that, for example, the variance of \(s^2_{\hat{\beta }_{3gz.g}}=19.96\) for Iran (Islamic Rep. of) (see Table 8) should be added to the standard error of \(\hat{\beta }_3\). Of course, more reliable estimates of this component are possible if we undertake a more exhaustive analysis where we implement a broader range of possible measurement models and also account for the random sample of students (by, for example, using bootstrapping methods).
Another approach that we could use to capture the dependency between measurement invariance and predictive invariance in largescale assessment studies is the assumption of partial measurement invariance. This approach implies, for example, that measurement invariance across countries can be assumed for only some of the HRL index items and that the parameters of the other items will be left to vary freely across countries. This linking or equation procedure means that while the latent variable across countries may still be compared, it must be acknowledged that dependency between the measurement invariance and the predictive invariance will decrease (if not vanish). Again, taking the present study as an example, the parameters of the HRL indicators “highest level of education of either parent” and “number of home study supports” would need to vary freely across countries, because these indicators are the ones that exhibit the highest variance in the discrimination parameter across countries (see Table 6). However, as we stated above, more exhaustive analysis are necessary before decision as concrete as this one can be made. One requirement that would need to be in place before this degree of analysis could be implemented for the HRL is surely that of defining the item sampling space for the HRL. Achieving this requirement, in turn, implies the need to develop a theoretical framework for the HRL index that is coherent and valid and reliable crossnationally, but whether this aim can be credibly achieved is a moot point.
Limitations of the present study
Although our study is the first study to provide a deeper insight into the relationship between measurement invariance and predictive invariance in largescale assessment studies and thus contributes, for example, to the research agenda of the IEAETS Research Institute, it has some limitations. The first is the index that we used. While it made sense for us to focus on the HRL index, it could be interpreted as a formative variable. As such, studying the relationship of measurement invariance and predictive invariance with the more reflective indices that are also part of, for example, TIMSS and PIRLS, seems advisable. In addition, the applied measurement model could be more exhaustive if it took into account the multilevel structure of the data and gave consideration to scaling models that have more parameters (or dimensions). In general, we did not know the true parameters of the models (both the scaling model and the prediction model) when we conducted our study. This lack of knowledge meant that we were unable to estimate the unbiased effect of the scaling model on the prediction coefficients. This consideration calls for implementation of another design, such as that used in simulation studies. Despite these limitations, we consider that the general inconsistency of measurement invariance and predictive invariance found in this study will remain valid even when these limitations have been satisfactorily resolved. We therefore think it safe to state that assuming measurement invariance of background indicators in crossnational studies of educational achievement is a challenge that needs to be addressed by anyone endeavoring to interpret crossnational differences in achievement.
Notes
 1.
The data sets are freely available under http://timss.bc.edu/timsspirls2011/internationaldatabase.html.
 2.
These data sets contained all necessary variables for the analysis. For a detailed description of the data sets, see Foy (2013).
 3.
Copyright © 20022012 SAS Institute Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc., Cary, NC, USA.
 4.
Due to iteration problems, the GLMM could not be fitted to nine countries: Botswana, Dubai (UAE), Hong Kong SAR, Northern Ireland, Norway, Quebec (Canada), Russian Federation, and United Arab Emirates. The student samples from these countries were therefore not used in this study.
 5.
Note that the newly created HRL indices were not, as was the case with the original HRL index, transformed to an \(N\sim (10.03, 1.82)\) metric. Instead, we left the scaling metric \(N\sim (0,1)\) unchanged. We chose to do this because the transformation that Martin et al. (2013) applied made sense when the latent variable was measured on the same scale, that is, when measurement invariance between countries was assumed. When countryspecific models were assumed for the HRL index, some equating procedures between the countryspecific distributions of the HRL index first had to be applied to make the transformation of these values meaningful. However, analyzing the influence of different equating procedures on the HRL index and thus on the GLMM results was beyond the scope of this paper.
Abbreviations
 AIC:

Akaike’s information criterion
 BIC:

Bayesian information criterion
 GLMM:

generalized linear mixed model
 ET:

early literacy tasks/early numeracy tasks
 HRL:

home resources for learning
 IEA:

International Association for the Evaluation of Educational Achievement
 MAP:

maximum a posterior probability
 PIRLS:

Progress in International Reading Literacy Study
 TIMSS:

Trends in International Mathematics and Science Study
 WLE:

weighted likelihood estimate
References
Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans Autom Control, 19, 716–723.
Bourdieu, P. (1986). The forms of capital. In J. Richardson (Ed.), Handbook of theory and research for the sociology of education (pp. 241–258). New York: Greenwood.
Bos, W., Wendt, H., Köller, O., & Selter, C. (2012). TIMSS 2011. Mathematische und naturwissenschaftliche Kompetenzen von Gundschulkindern in Deutschland im internationalen Vergleich. Münster: Waxmann.
Brennan, R. L. (2001). Generalizability theory. New York: Springer.
Caro, D., SandovalHernandez, A., & Lütke, O. (2014). Cultural, social and economic capital constructs: An evaluation using exploratory structural equation modeling. Sch Eff Sch Improv, 25, 433–450.
Caro, D., & SandovalHernandez, A. (2012). A exploratory structural equation modeling approach to evaluate sociological theories in international largescale assessment studies. In: Paper presented at the annual meeting of the American educational research association 2012
Çetin, B. (2010). Crosscultural structural parameter invariance on PISA 2006 student questionnaire. Eurasian J Educ Res, 38, 71–89.
Coleman, J. S. (1988). Social capital in the creation of human capital. Am J Sociol, 94, 95–120.
Fischer, G. H., & Molenaar, I. W. (1995) Rasch models. Foundations, recent developments, and applications. New York: Springer
Foy, P. (2013). TIMSS and PIRLS 2011 user guide for the fourth grade combined international database. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College and International Association for the Evaluation of Educational Achievement (IEA).
Hansson, Å., & Gustafsson, J.E. (2013). Measurement invariance of socioeconomic status across migrational background. Scand J Educ Res, 57, 148–166.
Karim, M. R., & Zeger, S. L. (1992). Generalized linear models with random effects salamander mating revisited. Biometrics, 48, 631–644.
Kasper, D. (2017). Multiple group comparisons of the fixed effects from the generalized linear mixed model. (In preparation)
Lakin, J. M. (2012). Multidimensional ability tests and culturally and linguistically diverse students: Evidence of measurement invariance. Learn Individ Differ, 22, 397–403.
Lüdtke, O., Marsh, H. W., Robitzsch, A., & Trautwein, U. (2011). A 2 \(\times\) 2 taxonomy of multilevel latent contextual models: Accuracybias tradeoffs in full and partial error correction models. Psychol Methods, 16, 444–467.
Martin, M. O., & Mullis, I. V. S. (2012). Methods and procedures in TIMSS and PIRLS 2011. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. http://timss.bc.edu/methods/index.html. Accessed 20 Feb 2017.
Martin, M. O., & Mullis, I. V. S. (2013). TIMSS and PIRLS 2011: Relationships among reading, mathematics, and science achievement at the fourth grade—implications for early learning. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College and International Association for the Evaluation of Educational Achievement (IEA).
Martin, M. O., Mullis, I. V. S., Foy, P., Olson, J. F., Erbeber, E., & Preuschoff, C. (2008). TIMSS 2007 international science report: Findings from IEA’s trends in international mathematics and science study at the fourth and eighth grades. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.
Martin, M., Mullis, I. V. S., Foy, P., & Stanco, G. M. (2012). TIMSS 2011 international results in science. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.
Martin, M. O., Foy, P., Mullis, I. V. S., & O’Dwyer, L. M. (2013). Effective schools in reading, mathematics, and science at the fourth grade. In M. O. Martin & I. V. S. Mullis (Eds.), TIMSS and PIRLS 2011: Relationships among reading, mathematics, and science achievement at the fourth grade—implications for early learning. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College and International.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.
McCulloch, C. E., & Searle, S. R. (2001). Generalized, linear, and mixed models. New York: Wiley.
Millsap, R. E. (1995). Measurement invariance, predictive invariance, and the duality paradox. Multivar Behav Res, 30, 577–605.
Millsap, R. E. (1997). Invariance in measurement and prediction: Their relationship in the singlefactor case. Psychol Methods, 2, 248–260.
Millsap, R. E. (1998). Group differences in regression intercepts: Implications for factorial invariance. Multivar Behav Res, 33, 403–424.
Millsap, R. E. (2007). Invariance in measurement and prediction revisited. Psychometrika, 72, 461–473.
Mullis, I. V. S., Martin, M. O., Kennedy, A. M., & Foy, P. (2007). PIRLS 2006 international report. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.
Mullis, I. V. S., Martin, M. O., Foy, P., Olson, J. F., Preuschoff, C., Erbeber, E., et al. (2008). TIMSS 2007 international mathematics report: Findings from IEA’s trends in international mathematics and science study at the fourth and eighth grades. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.
Mullis, I. V. S., Martin, M. O., Foy, P., & Drucker, K. T. (2012a). PIRLS 2011 international results in reading. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.
Mullis, I. V. S., Martin, M. O., Foy, P., & Arora, A. (2012b). TIMSS 2011 international results in mathematics. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Appl Psychol Meas, 16, 159–176.
Nagengast, B., & Marsh, H. W. (2013). Motivation and engagement in science around the globe: testing measurement invariance with multigroup structural equation models across 57 countries using PISA 2006. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international largescale assessment. Background, technical issues, and methods of data analysis, Chap. 15 (pp. 318–344). Boca Raton: Chapman and Hall/CRC.
OECD. (2014a). PISA 2012 results: What students know and can do—student performance in mathematics, reading and science (Vol. I, Revised edition, February 2014). Paris: PISA OECD Publishing.
OECD. (2014b). PISA 2012: Technical report. Paris: PISA, OECD Publishing.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models. Applications and data analysis methods. London: Sage Publications.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Schulte, K., Nonte, S., & Schwippert, K. (2013). Die Überprüfung von Messinvarianz in international vergleichenden Schulleistungsstudien am Beispiel der Studie PIRLS [Testing measurement invariance in international large scale assessments using the example of PIRLS data]. Zeitschrift für Bildungsforschung, 3, 99–118.
Schulz, W. (2005). Testing parameter invariance for questionnaire indices using confirmatory factor analysis and item response theory. Paper prepared for the Annual Meetings of the American Educational Research Association in San Francisco. http://files.eric.ed.gov/fulltext/ED493509.pdf. Accessed 20 Feb 2017.
Schwarz, G. (1978). Estimating the dimension of a model. Ann Stat, 6(6), 461–464.
Segeritz, M., & Pant, H. A. (2013). Do they feel the same way about math? Testing measurement invariance of the PISA “students’ approaches to learning” instrument across immigrant groups within Germany. Educ Psychol Meas, 73, 601–630.
Smith, D. S., Wendt, H., & Kasper, D. (2016). Social reproduction and sex in German primary schools. Compare J Comp Int Educ,. doi:10.1080/03057925.2016.1158643.
van den HeuvelPanhuizen, M., Robitzsch, A., Treffers, A., & Köller, O. (2009). Largescale assessment of change in student achievement: Dutch primary school students’ results on written division in 1997 and 2004 as an example. Psychometrika, 74, 351–365.
Wang, S., & Wang, T. (2001). Precision of warm’s weighted likelihood estimates for a polytomous model in computerized adaptive testing. Appl Psychol Meas, 25, 317–331.
Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.
Zeger, S. L., & Karim, M. R. (1991). Generalized linear models with random effects; A Gibbs sampling approach. J Am Stat Assoc, 86, 79–86.
Authors’ contributions
All authors made substantial contributions to the conception and the design of the study. In addition, HW provided the data sets for the analysis and DK conducted the analysis. DK drafted the manuscript. All authors made substantial contribution to the interpretation of the results. All authors read and approved the final manuscript.
Acknowledgements
The authors acknowledge the PIRLS/TIMSS International Study Center and Boston College for providing the technical documentation that allowed the replication of the key reference models published in Martin et al. (2013). The authors further acknowledge Wilfried Bos and the anonymous reviewers for the attention and expertise they generously shared to support the production of this paper. We finally thank Daniel Scott Smith and Paula Wagemaker for presubmission English editing support.
Competing interests
The authors declare that they have no competing interests.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Wendt, H., Kasper, D. & Trendtel, M. Assuming measurement invariance of background indicators in international comparative educational achievement studies: a challenge for the interpretation of achievement differences. Largescale Assess Educ 5, 10 (2017). https://doi.org/10.1186/s4053601700439
Received:
Accepted:
Published:
Keywords
 PIRLS/TIMSS combined
 Invariance background models
 Measurement and prediction invariance
 Generalized linear mixed model
 Sensitivity analyses for variance components