Skip to main content

An IERI – International Educational Research Institute Journal

  • Software Article
  • Open access
  • Published:

Using plausible values when fitting multilevel models with large-scale assessment data using R


The use of large-scale assessments (LSAs) in education has grown in the past decade though analysis of LSAs using multilevel models (MLMs) using R has been limited. A reason for its limited use may be due to the complexity of incorporating both plausible values and weighted analyses in the multilevel analyses of LSA data. We provide additional functions in R that extend the functionality of the WeMix (Bailey et al., 2023) package to allow for the automatic pooling of plausible values. In addition, functions for model comparisons using plausible values and the ability to export output to different formats (e.g., Word, html) are also provided.

The use of modern large-scale assessments (LSAs) in education has grown dramatically over the years. Based on metadata from the Web of Science database, the annual number of articles published using international LSAs in education has grown from fewer than 10 in 1997 to over 300 articles per year in 2020 (Hernández-Torrano & Courtney, 2021). Commonly-used datasets include PISA (the Programme for International Student Assessment organized by the Organization for Economic Co-operation and Development [OECD]) and TIMSS (the Trends in International Mathematics and Science Study coordinated by the International Association for the Evaluation of Educational Achievement [IEA]) (Hernández-Torrano & Courtney, 2021; Laukaityte & Wiberg, 2018).

Although the LSA public-use datasets are freely and readily available for download from the respective agency websites,Footnote 1 many of the tutorials for the analyses of such data have been limited to the use of commercial software such as Mplus (Yamashita et al., 2021) or SAS (Rutkowski et al., 2010). Recent articles (e.g., Caro & Biecek, 2017; Mirazchiyski, 2021) have focused on the use of the open-source R (R Core Team, 2022) statistical software. However, most of the articles do not focus on how to use multilevel models (Raudenbush & Bryk, 2002) for the analysis of LSAs using R.

Multilevel modeling (MLM or hierarchical linear modeling or mixed effects modeling) is a well-known and highly flexible regression-based approach used for the analysis of clustered or nested data. MLM allows the variance in the outcome variable to be appropriately partitioned within and between clusters—which in itself may be a research focus of interest (e.g., how much variability in the outcome is due to the school or the student?). Research questions focusing on the variance partitioning have had a long history in educational policy research such as those found in the Coleman report (1966) which looked at the unique contributions associated with student- and school-level factors related to academic achievement. MLM can also be used for data with more than two levels, commonly used in cross-national studies, allowing researchers to look at the student, school, and country effects all in one model (e.g., Baysu et al., 2023).

Although MLM has grown in popularity over the years (Huang, 2018), using R specifically for the multilevel analyses of LSAs has been limited. This is likely due to several of the following reasons which are specific to the analyses of LSA data. First, the two commonly-used R MLM packages of lme4 (Bates, 2010) and nlme (Pinheiro et al., 2022) do not allow for the use of sampling weights at the different levels of the model. Second, if the plausible values (PVs) of the outcome measures are to be used when analyzed using either lme4 or nlme, there has been no straightforward way (i.e., simply using a function) to properly pool results, aside from manually doing this through syntax (e.g., Lorah, 2022). Third, although packages such as EdSurvey (Bailey et al., 2020) and BIFIEsurvey (Robitzsch & Oberwimmer, 2022) provide features for R users to conduct multilevel analysis using PVs and weighted data analyses, users may have an additional challenge of learning a new package which requires figuring out how to download ILSA data (which is done using the package) and then filter, select, and recode the data specifically using custom-built package functions.Footnote 2 In a detailed comparison of five R packages for the analysis of LSA data, Ringiene et al. (2022) indicated that proper data preparation using R functions specific to certain packages can be complex and may lead applied researchers to not use R. Software such as Mplus (Muthén & Muthén, 1998–2017) and HLM (Raudenbush & Congdon, 2021) are popular among LSA researchers using MLM due to their ability to accommodate both the use of weights and plausible values (Karakolidis et al., 2022). Note however that software such as Mplus and HLM still require researchers to perform all of the data management necessary to analyze the data using other software.

As R has evolved over the years, so have its data management capabilities using packages such as dplyrFootnote 3 (Wickham et al., 2020) and tidyr (Wickham, 2021), making R much more accessible to applied researchers. Researchers who are already familiar with R, are used to managing their own data, and already know how to fit multilevel models may want to simply fit the models of interest with minimal coding or without having to learn how to use a different package. The R package WeMix (Weighted mixed effect models; Bailey et al., 2023) was specifically designed to allow users to fit multilevel models (both linear and logistic regression models) with weights at different levels (such as those commonly found in LSAs) and uses standard formula notation commonly used in other R functions.Footnote 4 However, the task of fitting multiple models and pooling the output—which is a due to the use of plausible values (Mislevy et al., 1992)—is still left to the users. To address this, we provide some R functions in a form of an R wrapper (which is a function that wraps around another function in R), that extends the functionality of WeMix to allow for the analysis of multiple datasets, pooling of results, the ability to conduct nested model comparisons, and easily output regression tables in a customizable and exportable format. The functions provided are specifically designed for users who already know how to obtain LSA data (i.e., download the data from the appropriate websites), are familiar with managing their data using R, already have a background on multilevel modeling (there are several primers on the topic) but want to simply analyze their data properly using multilevel modeling using R (i.e., WeMix follows the conventional mixed effects notation already used in lme4) without having to program how to pool results. We compare results as well to output produced using SAS and the EdSurvey package (see Appendices).

The challenge of analyzing LSA data

Two defining characteristics of LSAs involve the use of sampling weights and plausible values. For practical and statistical purposes, the samples used for LSAs are not simple random samples and are drawn for the purpose of making inferences about the population (i.e., population estimates) using multistage sampling. In addition, when students are assessed in a particular subject area (e.g., math, reading, science), students are only assigned portions of the assessment (i.e., certain blocks or booklets) and not the assessment in its entirety. With plausible value (Mislevy et al., 1992) methodology, as students do not complete the entire assessment, student achievement is treated as missing data which needs to be properly accounted for. Statistical analyses must account for these two design characteristics of weights and plausible values to avoid biasing both the point estimates and standard errors (Laukaityte & Wiberg, 2017; Rutkowski et al., 2010). The use of weights and plausible values are briefly described.

Using weights

The specific details of the weighting procedures are explained in the user manuals of the particular LSAs and have been discussed in much detail in several articles (Kim et al., 2013; Meinck, 2015; Rutkowski et al., 2010). The use of sampling weights with survey data though has been “a subject of controversy among theorists” (Pfeffermann, 1993, p. 317) and findings from Monte Carlo simulations (where the true population value is known) have shown different results where the use of weights may (Mang et al., 2021) or may not matter (Laukaityte & Wiberg, 2018). However, the general recommendation in the LSA manuals is to use the weights as the objective of the analyses is to make generalizations to the population and not the sample itself (Fishbein et al., 2021; Herget et al., 2019).

With multilevel models, weights can be formed at different levels. This corresponds to the sampling design where in some assessments, within a country (or locale), schoolsFootnote 5 (level 2) are first selected (with a probability proportional to size) and then students or teachers (level 1) within schools are sampled. Not all multilevel models may require the use of weights but when working with LSAs that have complex sampling designs and inferential statistics are of interest, weights can be used (Sterba, 2009).Footnote 6 To account for the sampling design, in a multilevel framework, using weights at the different levels has been suggested (Rathbun et al., 2021). Another approach would be to use the total student weight (which is a product generally of the school and student weights which also includes some other adjustments) on its own (Zhang et al., 2020). Yet, another alternative and simpler approach when running MLMs is to only use the school-level weight at the second level without the need to specify the level-1 weight (Mang et al., 2021). As the sampling weights account for the sampling design (e.g., any stratification or oversampling) as well as adjustments for nonresponse, the use of weights is recommended (Joncas, 2007). As indicated by Snijders and Bosker, “the reason for using sampling weights is to avoid bias” (2011, p. 221).

When using weights with multilevel models, careful attention must be paid as to what type of weights are being used and what the software is actually doing. Most LSAs may provide an unconditional student weight for use at level 1, however some software (e.g., SAS) may require the unconditional weight to be rescaled by dividing the level-1 weight by the school weight. For a discussion on and examples of how the different weights are computed, see Rutkowski et al. (2010) and as indicated, researchers “should consult their software documentation for the appropriate application of weights at multiple levels” (p. 144).

Using plausible values

To reduce the test burden on the respondents, students participating in LSAs do not complete the entire battery of assessments. For example, with TIMSS, if a student were to take the entire assessment, this would represent more than 10 h of testing time (Rutkowski et al., 2010). Instead, students are assigned certain test booklets to complete and, because of the administration method, individual testing time is reduced to 90 min.

However, as the students do not complete the entire assessment, this can be treated as a missing data problem where missing values can be imputed (Mislevy et al., 1992). Random draws (five or ten depending on the LSA) from an estimated ability distribution are repeatedly taken for every student which are referred to as plausible values (Rutkowski et al., 2010). Different LSAs may use a different number (m number) of plausible values and are appropriate for making population- or subpopulation-level estimates and they are not individual scores. These values represent an ability range for each student. As a result, additional measurement error is introduced into the outcome due to the use of multiple plausible values. Thinking of the plausible values as imputed values—as used in multiple imputation to account for missing data—may be helpful.

Even though each student has m plausible values representing some latent (i.e., unobserved) ability measure, an incorrect way of analyzing the data would be to take the average of all the plausible values or even just taking one of the values and then fitting a model (Aparicio et al., 2021). Doing so will result in generally underestimated standard errors which do not account for the variability resulting from the slightly different results for each m analyses. Instead, models should be fit m number of times, each with one of the plausible values as the outcome. As a result of the differing values, regression coefficients and standard errors will fluctuate slightly from model to model. The results of the m analyses should then be pooled using Rubin’s (2004) rules so that in the end, only one set of results are reported.

Rutkowski et al. (2010) provide an example showing how results can differ using only one value, an average set of values, and a properly pooled set of results. It is likely in this stage of the analysis where applied researchers may have some difficulty as even with software such as SAS, data have to be converted to a long format, analyzed multiple times, and then pooled appropriately. Note that the handling of plausible values is only of importance if the assessment measure is used as some analyses may not focus on the ability measures (e.g., focus is on bullying; Smith & López-Castro, 2017).

Pooling results: estimates and standard errors

Rubin’s (2004) method for pooling results has long been used with multiply-imputed data to account for imputation variability. For a regression model, pooling the regression coefficient b is straightforward and merely the average of the coefficients (\(\overline{b }\)) from the m analyses. The standard errors—which captures the uncertainty of the estimate, takes slightly more work to compute and is not the simple average of the standard errors.

Using formulas adapted from Schafer and Olsen (1998, p. 557), the pooled standard errors are made up of the within (\(\overline{{U }_{b}}\)) and between imputation (\({B}_{b}\)) variance for each b coefficient. The within imputation variance of a regression coefficient is \({\overline{U} }_{b} =\frac{\Sigma {SE}_{b}^{2}}{m}\) which is the average of the squared standard errors over the m sets of analyses. The between imputation variance is \({B}_{b}=\frac{{\left(b-\overline{b }\right)}^{2}}{m-1}\) which is the variance of the regression coefficients over the m sets of analyses. Combining the two sources of variance results in \({T}_{B}={\overline{U} }_{b}+\left(1+\frac{1}{m}\right){B}_{b}\) and the pooled standard error is \(S{E}_{b}=\sqrt{{T}_{b}}\).

The estimate (b) is then divided by its standard error to obtain the corresponding t-statistic. The corresponding degrees of freedom (df) for the b coefficient is computed as: \(df=\left(m-1\right){\left(1+\frac{m{\overline{U} }_{b}}{\left(m+1\right){B}_{b}}\right)}^{2}\). The p-values can then be evaluated using the t-statistic with the corresponding df. The subscript b indicates that this is computed for each b coefficient.

Pooling results: likelihood ratio tests

A common method for evaluating improvements in multilevel model fit uses a likelihood ratio test (LRT) that compares two nested models (i.e., a full and a restricted model) with each other. A restricted model is nested within a full model if the restricted model can be obtained by excluding parameters to be estimated from the full model. The difference in deviance statistics (deviance = − 2 \(\times\) log-likelihood or − 2LL) between the two models (i.e., Δd = − 2LLFULL—− 2LLREDUCED) is evaluated using a \({\chi }^{2}\) statistic with k degrees of freedom where k represents the difference in the number of parameters estimated in the full and the reduced models. A statistically significant result would indicate a better fit of the full model and a nonstatistically significant result would suggest that the simpler, more parsimonious model would suffice. However, there are several pooling approaches for LRTs to choose from and the computation is not as straightforward as the approach to pooling the estimates and standard errors (see Grund et al., 2023 for a comparison of three approaches).

We show the computation for the pooled statistic as proposed by Li et al. (1991) referred to as the D2 statistic by Schafer (1997, Eq. 4.40) which involves pooling the \({\chi }^{2}\) statistic from each m model analyzed using different plausible values. The D2 statistic is calculated using \({D}_{2}=\frac{\frac{\overline{d}}{k }-\frac{m+1}{m-1}{r}_{m}}{1+{r}_{m}}\) where \(\overline{d }\) is the average Δd statistic from the m models and rm is the an estimate of the average relative increase in variance as a result of missing (i.e., imputed) values. The formula for \({r}_{m}=\left(1+\frac{1}{m}\right)\left(\frac{{\sum_{i=1}^{m}(\sqrt{{d}_{m}}-\overline{\sqrt{d} })}^{2}}{m-1}\right)\) where \(\overline{\sqrt{d} }\) is the average value of the square root of Δd for each model, \(\frac{\sum_{i=1}^{m}\sqrt{{d}_{m}}}{m}\). Although the second part of the rm equation may look complicated, this is merely the variance of the square root of the Δd statistic for each m model.

The D2 statistic is evaluated using an F distribution with k df for the numerator and v2 df for the denominator where \({\text{v}}_{2}={k}^{-3/m}\left(m-1\right){\left(\frac{1}{{r}_{2}}\right)}^{2}\) (see Schafer, 1997, Eq. 4.41). The F distribution used for the D2 corresponds to the \({\chi }^{2}\) distribution for LRTs using complete data but accounts for the number of imputations (m plausible values) used (Grund et al., 2023). Although the D2 statistic has been found to result in somewhat higher levels of Type I errors, this is an issue with smaller sample sizes (e.g., n = 100) (Grund et al., 2023) which is not the case when analyzing LSAs which typically have thousands of observations and over a hundred clusters.

Some researchers though may want to use information criterion measures, such as the Akaike information criterion (AIC), to assess the quality of competing statistical models, with lower values indicating better model fit. Combining these measures as a result of multiple models, which are predicated on using the same dataset, is not clear (Grund et al., 2016). Some though have suggested using the average of the AIC measures resulting from m datasets or creating and analyzing an averaged dataset using the m complete datasets (Schomaker et al., 2010 as cited in Consentino & Claeskens, 2010, p. 2294). Using a simulation, Consentino and Claeskens showed that different pooling approaches for the AIC performed similarly. Though commonly done, we caution against the use of information criterion measures which may actually not perform well when selecting the best fitting models (e.g., Ferron et al., 2002; Gelman & Rubin, 1994; Vallejo et al., 2008).

The current study

To reduce the complexity in the analysis of LSA data using multilevel models, we provide several freely downloadable functions (available at, to be used together with the WeMix package (Bailey et al., 2023) for R. The following functions are provided:

  • mixPV: for the analysis using plausible values using the mix function in WeMix.

  • summary: for generating the pooled output resulting from mixPV.

  • summary_all: for viewing the MLM output for each plausible value.

  • lrtPV: for conducting model comparisons using models fit using plausible values.

  • glance: for viewing summary statistics.

In addition, “helper” functionsFootnote 7 are provided that make the output readily exportable—in formats such as Word or html, using the modelsummary (Arel-Bundock et al., 2022) package.

Data analysis

We extend the example in the WeMix vignetteFootnote 8 that used PISA 2012 data from the United States (USA) but only used the first plausible value for math (pv1math). For the current analyses, five plausible values (i.e., pv1math, pv2math, pv3math, pv4math, pv5math) will be used. Data are available at and requires researchers to select only data from the USA and merge both the student and school data files using the schoolid variable.Footnote 9 The files are merged as required by standard multilevel modeling software and we do so in this example in order to use both student- and school-level predictors. The predictors used for the current example are shown in Table 1.

Table 1 Descriptive statistics

The student- and school-level weights at level one and level two are w_fstuwt and w_fschwt, respectively. These weights are provided in the PISA dataset and are referred to as unconditional weights which can be used directly (i.e., without alteration) with the mix function when specifying the weights at two levels. To compute conditional student weights (which are used by some software), the total student weight (w_fstuwt) can be divided by the school weight (w_fschwt).Footnote 10 Of the observations without missing data, there were 3,136 students nested in 157 schools.

Using composite notation, the random intercepts model can be expressed as: \({Y}_{ij}={\gamma }_{00}+{\gamma }_{01}LM{T}_{j}^{VL}+{\gamma }_{02}LM{T}_{j}^{SE}+{\gamma }_{03}LM{T}_{j}^{AL}+{\gamma }_{10}LF{M}_{ij}^{A}+{\gamma }_{20}L{FM}_{ij}^{D}+{\gamma }_{30}LF{M}_{ij}^{SD}+{\gamma }_{40}MAL{E}_{ij}+{\gamma }_{50}ESC{S}_{ij}+{u}_{0j}+{r}_{ij}\), where \({Y}_{ij}\) is the outcome for student i in school j; \(LM{T}_{j}^{VL-AL}\) represent three dummy codes for the school-level variable for the lack of qualified math teachers; \(LF{M}_{ij}^{A-SD}\) represent three dummy codes for the student-level variable for looking forward to math lessons; \(MAL{E}_{ij}\) is a dummy code for student gender; and \(ESC{S}_{ij}\) is the continuous student-measure of socioeconomic status. The error term \({u}_{0j}\) captures the variability of the outcome between schools and \({r}_{ij}\) is the student-level error term.

The standard method of fitting a random intercepts model using the mix function (without plausible values) in WeMix can be done with the following specification:

figure a

However, to use the five plausible values in the analysis, we specify each of the plausible values as dependent variables (on the left-hand side of the equation) in one model using the new mixPV function. There is no need to reshape the data from a wide to long format as may be required by other software (e.g., SAS). To generate results and to have the output properly formatted, the broom package (Robinson et al., 2022) needs to be installed as well using install.packages('broom'). The newly introduced functions can be loaded using the source function:

figure b
figure c
figure d

Note that the only difference in the specification is the use of multiple values for the dependent variable (i.e., pv1math + pv2math + pv3math + pv4math + pv5math) together with the mixPV function. The output shows both the combined random and fixed effects using the point estimates, standard errors, the t-statistics, degrees of freedom, and the p-values computed and combined using Rubin’s (2004) rules. By default, the standard errors reported are also the robust standard errors (Liang & Zeger, 1986) which account for heterogeneity of variance violations (Huang et al., 2022). For a more detailed discussion on robust standard errors and their computation in the context of mixed models, readers can consult Huang et al. (2022).

Although the mixPV function is run once, the model is fit five times using mix, once for each plausible value specified. If the user wants to see the result of each analysis separately, summary_all(m0) can be used. The glance(m0) function can also be used to view the number of observations, the number of plausible values used, and the average AIC and BIC statistics:

figure e

Following the original WeMix vignette, a random slope for escs can also be specified in the standard manner (as done in lmer and lme) and in this case the variable escs is allowed to randomly vary by school by including (escs|schoolid).

figure f
figure g
figure h

Although the random slope (schoolid.escs) shows the p-value of the associated Wald test (i.e., p < .001), the use of a likelihood ratio test (LRT) is often recommended when testing variance components (Berkhof & Snijders, 2001). To conduct a model comparison between a model using a likelihood ratio test with and without a random slope, the lrtPV function can be used by specifying the fitted full and the reduced model (note that the order has be the full model first and the reduced model second):

figure i

The likelihood ratio test, based on the D2 statistic (Li et al., 1991), indicates that the model fits better with the random slope, F(2, 2.65) = 98.3, p < .01.

Finally, several model results can be shown side-by-side using the modelsummary function (from the package of the same name). The results can be shown using:

figure j

Note that the output table in Fig. 1 is shown ‘as-is’ and needs some editing to get it ready for publication (e.g., indicating that these are robust standard errors in parenthesis, indicating the reference group for the categorical variables, separating the random effects, adding other notes). The modelsummary function has many useful options for formatting the output (e.g., showing confidence intervals, having estimates and standard errors beside other, controlling the number of digits to show). There is extensive documentation for the use of the modelsummary function available at: If the random effects are to be hidden (so that only fixed effects are shown), the coef_omit = 'schoolid|Residual' option can be added (the characters within the quotations and separated by the pipe operator [|] are matched and hidden). By default, the output is displayed onscreen but if instead the user wants to output the file to a Word file, the option out = 'results.docx' can be specified (other options include jpg, html, tex). As a basis for comparison, model results using plausible values and weights analyzed using both SAS proc glimmix and the EdSurvey package are shown in the appendix and results are similar.

Fig. 1
figure 1

Output using the modelsummary function


The current manuscript demonstrates additional functions that extend the use of the WeMix package to allow for the pooling of MLM results from models using plausible values. Such a feature is required for the proper analysis of LSA data with outcomes that use plausible values. In addition, functions are introduced that allow for model comparisons using likelihood ratio tests and allow results to be exported into other formats for easier editing.

Availability of data and materials

I used PISA public use data which is available from The data may also be loaded in R by entering: data(pisa2012, package = 'MLMusingR') as long as the MLMusingR package is installed.


  1. For example: or

  2. Obtaining LSA data may be challenging for some and may require having access to either SAS or SPSS. The use of the R packages for obtaining the data helps reduce the burden on the users who may not have access to the necessary commercial software.

  3. As a sign of its popularity, based on results from the packageRank package, as of 2023.11.27, dplyr was the 10th most downloaded package out of 19,625 R packages on CRAN.

  4. Note that the BIFIEsurvey package (Robitzsch & Oberwimmer, 2022) can fit multilevel linear models but only allows for two-level models.

  5. This also depends on the country. For example, in a small country such as Singapore, all schools are selected so the corresponding school weight is 1.0.

  6. For a discussion on model-, design-, and hybrid-based approaches to analysis, readers can consult Sterba (2009).

  7. A function that allows one function to work with other functions.


  9. When importing SPSS files into R, users can use the rio::import() function. Although the haven::read_sav() function may work, WeMix may have issues with the labels used in.

    haven. The variable labels may be removed using the haven::zap_labels() function.

    The combined R data file can also be accessed using.

     > data(pisa2012, package = 'MLMusingR') # from package version 0.3.2 or.

     > pisa2012 <—rio::import("").

  10. By default, this does not have to be done when using the mix function. However, if conditional weights are used, this option can be set by using the mix function and including the option cWeights = TRUE. The conditional weight in the dataset is variable pwt1.


  • Aparicio, J., Cordero, J. M., & Ortiz, L. (2021). Efficiency analysis with educational data: how to deal with plausible values from international large-scale assessments. Mathematics, 9(13), 1579.

    Article  Google Scholar 

  • Arel-Bundock, V., Gassen, J., Eastwood, N., Huntington-Klein, N., Schwarz, M., Elbers, B., McDermott, G., & Wallrich, L. (2022). modelsummary: Summary tables and plots for statistical models and data: Beautiful, customizable, and publication-ready (1.2.0) [Computer software].

  • Bailey, P., Kelley, C., Nguyen, T., & Huo, H. (2023). WeMix: Weighted mixed-effects models using multilevel pseudo maximum likelihood estimation.

  • Bailey, P., Lee, M., Nguyen, T., & Zhang, T. (2020). Using EdSurvey to analyse PIAAC data. In D. B. Maehler & B. Rammstedt (Eds.), Large-scale cognitive assessment (pp. 209–237). Springer International Publishing.

    Chapter  Google Scholar 

  • Bates, D. M. (2010). lme4: Mixed-effects modeling with R. Springer.

    Google Scholar 

  • Baysu, G., Agirdag, O., & De Leersnyder, J. (2023). The association between perceived discriminatory climate in school and student performance in math and reading: A cross-national analysis using PISA 2018. Journal of Youth and Adolescence, 52(3), 619–636.

    Article  PubMed  Google Scholar 

  • Berkhof, J., & Snijders, T. A. (2001). Variance component testing in multilevel models. Journal of Educational and Behavioral Statistics, 26(2), 133–152.

    Article  Google Scholar 

  • Caro, D. H., & Biecek, P. (2017). intsvy: An R package for analyzing international large-scale assessment data. Journal of Statistical Software, 81, 1–44.

    Article  Google Scholar 

  • Coleman, J., Campbell, E., Hobson, C., McPartland, J., Mood, A., Weinfield, F., & York, R. (1966). Equality of educational opportunity. Government Printing Office.

    Google Scholar 

  • Consentino, F., & Claeskens, G. (2010). Order selection tests with multiply imputed data. Computational Statistics & Data Analysis, 54(10), 2284–2295.

    Article  MathSciNet  Google Scholar 

  • Ferron, J., Dailey, R., & Yi, Q. (2002). Misspecifying the first-level error structure in two-level models of change. Multivariate Behavioral Research, 37(3), 379–403.

    Article  PubMed  Google Scholar 

  • Fishbein, B., Foy, P., & Yin, L. (2021). TIMSS 2019 user guide for the international database (2nd edn). TIMSS & PIRLS International Study Center.

  • Gelman, A., & Rubin, D. B. (1994). Avoiding model selection in Bayesian social research. Sociological Methodology, 25, 165–173.

    Article  Google Scholar 

  • Grund, S., Lüdtke, O., & Robitzsch, A. (2016). Multiple imputation of multilevel missing data: An introduction to the r package pan. SAGE Open, 6(4), 2158244016668220.

    Article  Google Scholar 

  • Grund, S., Lüdtke, O., & Robitzsch, A. (2023). Pooling methods for likelihood ratio tests in multiply imputed data sets. Psychological Methods.

    Article  PubMed  Google Scholar 

  • Herget, D., Dalton, B., Kinney, S., Smith, W. Z., Wilson, D., & Rogers, J. (2019). US PIRLS and ePIRLS 2016 technical report and user’s guide. NCES 2019-113. National Center for Education Statistics.

  • Hernández-Torrano, D., & Courtney, M. G. R. (2021). Modern international large-scale assessment in education: An integrative review and mapping of the literature. Large-Scale Assessments in Education, 9(1), 17.

    Article  Google Scholar 

  • Huang, F. L. (2018). Multilevel modeling myths. School Psychology Quarterly, 33(3), 492–499.

    Article  PubMed  Google Scholar 

  • Huang, F. L., Wiedermann, W., & Zhang, B. (2022). Accounting for heteroskedasticity resulting from between-group differences in multilevel models. Multivariate Behavioral Research.

    Article  PubMed  Google Scholar 

  • Joncas, M. (2007). PIRLS 2006 sampling weights and participation rates. In M. Martin, I. Mullis, & A. Kennedy (Eds.), PIRLS 2006 Technical report (pp. 105–130). TIMSS & PIRLS International Study Center.

    Google Scholar 

  • Karakolidis, A., Pitsia, V., & Cosgrove, J. (2022). Multilevel modelling of international large-scale assessment data. In M. S. Khine (Ed.), Methodology for multilevel modeling in educational research (pp. 141–159). Springer Singapore.

    Chapter  Google Scholar 

  • Kim, J.-S., Anderson, C. J., & Keller, B. (2013). Multilevel analysis of assessment data. Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis, 389–425.

  • Laukaityte, I., & Wiberg, M. (2017). Using plausible values in secondary analysis in large-scale assessments. Communications in Statistics - Theory and Methods, 46(22), 11341–11357.

    Article  MathSciNet  Google Scholar 

  • Laukaityte, I., & Wiberg, M. (2018). Importance of sampling weights in multilevel modeling of international large-scale assessment data. Communications in Statistics - Theory and Methods, 47(20), 4991–5012.

    Article  MathSciNet  Google Scholar 

  • Li, K.-H., Meng, X.-L., Raghunathan, T. E., & Rubin, D. B. (1991). Significance levels from repeated p-values with multiply-imputed data. Statistica Sinica, 65–92.

  • Liang, K.-Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1), 13–22.

    Article  MathSciNet  Google Scholar 

  • Lorah, J. (2022). Analyzing large-scale assessment data with multilevel analyses: Demonstration using the Programme for International Student Assessment (PISA) 2018 data. In M. S. Khine (Ed.), Methodology for multilevel modeling in educational research (pp. 121–139). Springer Singapore.

    Chapter  Google Scholar 

  • Mang, J., Küchenhoff, H., Meinck, S., & Prenzel, M. (2021). Sampling weights in multilevel modelling: An investigation using PISA sampling structures. Large-Scale Assessments in Education, 9(1), 6.

    Article  Google Scholar 

  • Meinck, S. (2015). Computing sampling weights in large-scale assessments in education. Survey Methods: Insights from the Field, 1–13.

  • Mirazchiyski, P. V. (2021). RALSA: The R analyzer for large-scale assessments. Large-Scale Assessments in Education, 9, 1–24.

    Article  Google Scholar 

  • Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29(2), 133–161.

    Article  Google Scholar 

  • Muthén, L., & Muthén, B. (1998). Mplus user’s guide (8th ed.). Muthén & Muthén.

    Google Scholar 

  • Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International Statistical Review/revue Internationale De Statistique.

    Article  Google Scholar 

  • Pinheiro, J., Bates, D., & R Core Team. (2022). nlme: Linear and nonlinear mixed effects models.

  • R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing.

  • Rathbun, A., Huang, F., Meinck, S., Park, B., Ikoma, S., & Zhang, Y. (2021). Multilevel modeling with large-scale international datasets. American Educational Research Association, Virtual conference.

  • Raudenbush, S., & Bryk, A. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Sage.

    Google Scholar 

  • Raudenbush, S., & Congdon, R. (2021). HLM 8: Hierarchical linear and nonlinear modeling (Version 8) [Computer software]. Scientific Software International, Inc.

  • Ringienė, L., Žilinskas, J., & Jakaitienė, A. (2022). ILSA data analysis with R packages. Modelling, Computation and Optimization in Information Systems and Management Sciences: Proceedings of the 4th International Conference on Modelling, Computation and Optimization in Information Systems and Management Sciences-MCO 2021 4, 271–282.

  • Robinson, D., Hayes, A., & Couch, S. (2022). broom: Convert statistical objects into tidy tibbles.

  • Robitzsch, A., & Oberwimmer, K. (2022). BIFIEsurvey: Tools for survey statistics in educational assessment.

  • Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys (Vol. 81). Wiley.

    Google Scholar 

  • Rutkowski, L., Gonzalez, E., Joncas, M., & Von Davier, M. (2010). International large-scale assessment data: Issues in secondary analysis and reporting. Educational Researcher, 39(2), 142–151.

    Article  Google Scholar 

  • Schafer, J. L. (1997). Analysis of incomplete multivariate data. CRC Press.

    Book  Google Scholar 

  • Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data problems: A data analyst’s perspective. Multivariate Behavioral Research, 33(4), 545–571.

    Article  CAS  PubMed  Google Scholar 

  • Smith, P. K., & López-Castro, L. (2017). Cross-national data on victims of bullying: How does PISA measure up with other surveys? International Journal of Developmental Science, 11(3–4), 87–92.

    Article  Google Scholar 

  • Snijders, T. A. B., & Bosker, R. J. (2011). Multilevel analysis: An introduction to basic and advanced multilevel modeling. SAGE.

    Google Scholar 

  • Sterba, S. K. (2009). Alternative model-based and design-based frameworks for inference from samples to populations: From polarization to integration. Multivariate Behavioral Research, 44(6), 711–740.

    Article  PubMed  PubMed Central  Google Scholar 

  • Vallejo, G., Ato, M., & Valdés, T. (2008). Consequences of misspecifying the error covariance structure in linear mixed models for longitudinal data. Methodology, 4(1), 10–21.

    Article  Google Scholar 

  • Wickham, H. (2021). tidyr: Tidy messy data.

  • Wickham, H., François, R., Henry, L., & Müller, K. (2020). dplyr: A grammar of data manipulation.

  • Yamashita, T., Smith, T. J., & Cummins, P. A. (2021). A practical guide for analyzing large-scale assessment data using Mplus: A case demonstration using the program for international assessment of adult competencies data. Journal of Educational and Behavioral Statistics, 46(4), 501–518.

    Article  Google Scholar 

  • Zhang, T., Bailey, P., & Lee, M. (2020). Using EdSurvey to analyze TIMSS data.

Download references


No further known acknowledgements should be stated.


No funding was received for this work.

Author information

Authors and Affiliations



The author read and approved the final manuscript.

Corresponding author

Correspondence to Francis L. Huang.

Ethics declarations

Ethics approval and consent to participate

The present study worked with previously collected PISA data. Therefore, the source data is already anonymized, free, and publicly available. Consequently, ethics approval for this study was not requested.

Consent for publication

Not applicable.

Competing interests

The author reports no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix A. Two-level multilevel model results using five plausible values and weights analyzed using SAS proc glimmix (n = 3136)











 − 11.287 + 

 − 10.533 + 





 − 19.526**

 − 17.441**




st29q03Strongly disagree1

 − 39.929***

 − 36.910***




sc14q02Very little2

 − 22.549

 − 22.702




sc14q02To some extent2

 − 17.508

 − 16.158




sc14q02A lot2

 − 27.257***

 − 44.538***

































\(\overline{AIC }\)



\(\overline{BIC }\)



  1. RI  random intercepts model. RS  random slope model. Robust standard errors within parenthesis. 1Strongly agree is the reference level. 2Not at all is the reference level. 3Female is the reference level. + p < 0.10, *p < 0.05, **p < 0.01, ***p < 0.001

Appendix B. Two-level multilevel model results using five plausible values and weights analyzed using the EdSurvey package in R (n = 3136)











 − 11.286

 − 10.536





 − 19.534

 − 17.404




st29q03Strongly disagree1

 − 39.949

 − 36.881




sc14q02Very little2

 − 22.927

 − 23.178




sc14q02To some extent2

 − 17.605

 − 13.748




sc14q02A lot2

 − 30.050

 − 35.744

































  1. RI  random intercepts model. RS  random slope model. Robust standard errors within parenthesis. 1Strongly agree is the reference level. 2Not at all is the reference level. 3Female is the reference level. p-values are not shown when using the EdSurvey package

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, F.L. Using plausible values when fitting multilevel models with large-scale assessment data using R. Large-scale Assess Educ 12, 7 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: