- Research
- Open Access
On imputation for planned missing data in context questionnaires using plausible values: a comparison of three designs
- David Kaplan^{1}Email author and
- Dan Su^{1}
https://doi.org/10.1186/s40536-018-0059-9
© The Author(s) 2018
Received: 11 October 2017
Accepted: 11 June 2018
Published: 20 June 2018
Abstract
Background
This paper extends a recent study by Kaplan and Su (J Educ Behav Stat 41: 51–80, 2016) examining the problem of matrix sampling of context questionnaire scales with respect to the generation of plausible values of cognitive outcomes in large-scale assessments.
Methods
Following Weirich et al. (Nested multiple imputation in large-scale assessments. In: Large-scale assessments in education, 2. http://www.largescaleassessmentsineducation.com/content/2/1/9, 2014) we examine single + multiple imputation and multiple + multiple imputation methods using predictive mean matching imputation under three different context questionnaire matrix sampling designs: a two-form design studied by Adams et al. (On the use of rotated context questionnaires in conjunction with multilevel item response models. In: Large-scale assessments in education. http://www.largescaleassessmentsineducation.com/content/1/1/5, 2013), a three-form design implemented in PISA 2012, and a partially-balanced incomplete design studied by Kaplan and Su (J Educ Behav Stat 41: 51–80, 2016).
Results
Our results show that the choice of design has a larger impact on the reduction of bias than the choice of imputation method. Specifically, the three-form design used in PISA 2012 yields considerably less bias compared to the two-form design and the partially balanced incomplete design. We further show that the partially balanced incomplete block design produces less bias than the two-form design despite having the same amount of missing data.
Conclusions
We discuss the results in terms of implications for the design of context questionnaires in large-scale assessments.
Introduction
A recent paper by Kaplan and Su (2016) investigated the problem of matrix sampling of context questionnaires with respect to the generation of the plausible values (PVs) of the so-called “cognitive” tests in large-scale educational assessments. Drawing on earlier work by Adams et al. (2013) based on PISA 2012 OECD (2014) and motivated by the desire among policy-makers to increase non-cognitive content in national and international large-scale assessments, Kaplan and Su found that matrix sampling of context questionnaire (CQ) material followed by predictive mean matching imputation can quite accurately recover the known marginal distributions of the PVs. However, bias was found in the estimation of correlations between CQ scales and PVs.^{1} Kaplan and Su (2016) speculated that this bias was due to the fact that the plausible values were not part of the missing data imputation model and hence not “congenial” in the sense of Meng (1994).
In this paper, we investigate two approaches to multiple imputation that use PVs in the imputation model for the missing CQ data. To our knowledge, the two approaches discussed in Weirich et al. (2014) have not been studied across different missing data designs, and so an important feature of this paper is that we compare these approaches under three planned missing data designs: a two-form design examined by Adams et al. (2013), a three-form design that was used for PISA 2012, and a partially balanced incomplete block design (PBIB) studied by Kaplan and Su (2016). We carry out our investigation by simulating these designs on data from PISA 2006, allowing a comparison of our findings to the actual empirical data. We evaluate the marginal distributions of the PVs, the correlations among the imputed CQ variables and the PVs as well as the estimates of regression coefficients and their corresponding standard errors, by comparing to the original questionnaire without matrix sampling and imputation.
The organization of this paper is as follows. In the next section, we provide a review of the literature on missing data in large-scale assessments by first suggesting that prior research on the topic can be situated within the framework of congenial missing data problems. This is followed by an overview of the matrix sampling designs used in this study. This is then followed by a description of the simulation design for this paper. Next, we provide the results of our simulation studies focusing on recovery of marginal distributions of PVs, bias in the correlations among the CQ and PVs, and bias in regression coefficients and their standard errors from a regression of the PVs on the CQ scales. The paper closes with a discussion of the results in light of recent calls for increased focus on the policy importance of context questionnaires in large-scale assessments.
Background
As noted earlier, one finding of the Kaplan and Su (2016) paper was that correlations among the PVs and imputed CQ scales were biased. Kaplan and Su speculated that this bias was due, in part, to the fact that the PVs themselves were not included in the imputation models that they explored. Omitting the PVs as part of the imputation process leads to uncongeniality between the imputation model and the analysis model. This is a particular problem for secondary analyses of large-scale assessments insofar as the PVs of the cognitive assessments are, arguably, of primary policy importance.
Congenial missing data problems
The problem of uncongeniality has led to the general principle that one should include as many variables as possible in the imputation model for the missing data (see e.g. Rubin (1996)). Considering that the PVs in the various knowledge domains are a central focus of large-scale assessments, one purpose of this paper is to examine how PVs can be generated and used in the imputation of planned missing CQ data under different imputation methods and different designs and how these methods and designs impact the potential biases in secondary analyses.“...uncongeniality... essentially means that the analysis procedure does not correspond to the imputation model. The uncongeniality arises when the analyst and the imputer have access to different amounts and sources of information, and have different assessments (e.g., explicit model, implicit judgement) about both responses and non-responses. If the imputer’s assessment is far from reality, then, as Rubin (1995)^{2} wrote, “all methods for handling nonresponse are in trouble” based on such an assessment; all statistical inferences need underlying key assumptions to hold at least approximately. If the imputer’s model is reasonably accurate, then following the multiple-imputation recipe prevents the analyst from producing inferences with serious nonresponse biases.”
Related research
Much of the extant literature addressing the topic of missing data in the CQ has focused on its impact with respect to the model used in generating population and sub-population ability estimates of plausible values (e.g. Mislevy (1991); von Davier et al. (2009); Rutkowski (2011)) and on item or variable non-response (e.g. Aßman et al. (2015)). The general finding is that sub-population estimates of plausible values are relatively stable under conditions of missing-at-random and not-missing-at random (Rutkowski 2011). The present paper continues along the line of inquiry found in von Davier (2014) and Kaplan and Su (2016) focusing on planned missing data arising from a deliberate matrix sampling of the CQ. As noted earlier, we extend on this current work by examining alternative planned missing data designs and by examining alternative approaches to imputing missing data in the CQ.
There are many approaches to addressing missing data in the CQ when generating PVs. A rather ad hoc method implemented in PISA is to use country means for missing values and create dummy codes to indicate missingness in the CQ. This is then followed by a principal components analysis to reduce dimensionality and ease the computation of the PVs. The difficulty with the dummy coding approach, as pointed out by Aßman et al. (2015), is that it does not incorporate the PVs as part of the CQ imputation. This, in turn, results in an uncongenial missing data model and also does not address uncertainty arising from the missing data in the CQ. To explicitly address uncertainty arising from the missing data requires the use of multiple imputation methods (Rubin 1987). In principle, non-parametric or parametric methods can be used, however the question remains how the PVs can be incorporated into the imputation of the CQ missing data.
Rather than using dummy variables that are coded to address missing data, a general approach to incorporating PVs into the imputation of the CQ is through the use of multiple imputation (Rubin 1987). A discussion of multiple imputation using PVs for CQ imputation was provided by Weirich et al. (2014). In their paper, Weirich et al. (2014) distinguish between two approaches to multiple imputation in this context: single + multiple imputation (SMI) and multiple + multiple imputation (MMI). Following Weirich et al. (2014), four steps are required for the SMI approach. The first step is to create a of the item response model for the cognitive assessment without the use of the CQ. A simple marginal maximum likelihood approach can be used for this step. The second step involves imputation of the CQ using a proxy for the latent ability \(\theta\). Proxies could include simple percentage correct, maximum likelihood estimates (MLEs), or Warm weighted likelihood estimates (WLEs; Warm (1989)). However, should be noted that these proxies are biased estimates of latent ability. Indeed, an important contribution of our paper is that we will use the generated PVs directly in the imputation of the CQ. The third step requires estimating the parameters of a latent regression model in which the latent ability variable is regressed on the set of CQ variables, where the CQ is now completed due to the imputation from the second step. This step is required in order to impute plausible values of the latent ability distribution and is standard procedure in large-scale assessments (see, e.g. von Davier (2014)).^{3} The fourth step is the generation of the PVs based on a “completed” CQ.
As shown by Weirich et al. (2014), the SMI approach does reduce bias in the population model when the uncertainty in the CQ depends on \(\theta\). However, as they also note, the SMI approach is still not optimal because there exists uncertainty in the estimation of \(\theta\) due to missingness in the CQ. To fully address this uncertainty, Weirich et al. (2014) advocate for the MMI approach. The MMI approach is based on the notion of nested (or two-stage) multiple imputation developed by Rubin (2003), (see also; Schafer and Graham (2002); Reiter and Raghunathan (2007); Harel (2007)). The basic steps of MMI require that in the second step described for SMI, M imputations of the CQ are created. Step 3, then, must be repeated M times, and then this is followed by Step 4 where, say, K plausible values are drawn from the posterior distribution of latent ability resulting in \(M \times K\) plausible values. As noted by Rubin (2003, p. 6), the usual combining rules under multiple imputation must be modified because nested imputations are correlated.
In an extensive simulation study, Weirich et al. (2014) found that the SMI and MMI approaches provided roughly comparable reduction of bias in the population model. They go on to suggest that one limitation of their study was the use of the WLEs as proxies for \(\theta\). As noted by Wu (2005), the problem with using WLEs as proxies for \(\theta\) is that it is a biased estimate of the population mean unless the same test items are given to all respondents—which is not the case with international large-scale assessments such as PISA which utilize a balanced incomplete-block spiraling design for test booklets. Moreover, as pointed out by Mislevy et al. (1992), WLEs are susceptible to scale unreliability. We address the problem of using WLEs by estimating PVs directly in the imputation process.
The present paper focuses on the development of a complete data base for secondary statistical modeling and expands the work of Weirich et al. (2014) in several ways. First, as noted earlier, our focus is specifically on the problem of planned missing data designs in the CQ as implemented in the 2012 cycle of PISA rather than item/variable missing data. Second, we examine the SMI and MMI approaches to multiple imputation across three different designs that are relevant to large-scale assessments. Finally, our focus is on the perspective of the secondary data-analyst. Specifically, we focus on bias in correlations and regression coefficients derived from secondary studies, rather than bias in item parameters.
Matrix sampling designs for the context questionnaire
A classic study of matrix sampling designs can be found in Shoemaker (1973) who provided procedural guidelines and computational formulas for a variety of matrix sampling designs. More recently, Frey et al. (2009) provided a didactic discussion of matrix sampling designs, carefully outlining theoretical and practical implications for a variety of different designs. Gonzalez and Rutkowski (2010) also outlined a variety of matrix sampling designs and showed the impact of these designs on item and person parameter recovery in a simulation study of a large-scale assessment.
In this section, we first introduce the PISA 2006 student context questionnaire data that we use in our study and then the three matrix sampling designs that we implement. The original context questionnaire of PISA 2006 contains all respondents’ background information. In order to investigate what would happen if we would have implemented a matrix sampling design on the context questionnaire, we simulate the two-form design, three-form design, and the PBIB design, using the US data of PISA 2006. To simulate a matrix sampling design, parts of respondents’ information are deleted (i.e., set to be missing) from the original CQ data. The missing information is then imputed so that the end-users have the complete data to conduct subsequent analyses.
Data
We use the 34 scales in the PISA 2006 context questionnaire as the background variables which are also used in the Adams et al. (2013) paper. Some of the scales are based on single items, such as GENDER and AGE. Others, such as science self-concept (SCSCIE) are derived first from an IRT scaling of items constituting the construct. The resulting indices derived from the IRT scaling are then treated as manifest variables in the conditioning model. Based on the PISA 2006 technical report OECD (2009), dichotomous items were scaled using a one-parameter Rasch model Rasch (1960), and items with more than two response categories were scaled using the partial-credit model (Masters 1982, see also; Masters and Wright (1997)). Table 1 describes the 34 scales in the two-form matrix sampling design. The US data consist of 5611 respondents. The initial missing data from the respondents are imputed to make sure the original CQ does not contain any item or scale missing data.
Two-form design (Adams et al. 2013)
Two-form design for PISA 2006 simulation study based on Adams et al. (2013)
Questionnaire form 1 | Questionnaire form 2 | ||
---|---|---|---|
Common block | |||
Scale name | Scale description | ||
PROGN | Country study program | ||
GRADE | Grade | ||
AGE | Age of the student | ||
GENDER | Gender | ||
BMMJ | Occupation of mother | ||
BFMJ | Occupation of father | ||
BSMJ | Occupation of self at 30 | ||
MISCEDN | Educational level of mother | ||
FISCED | Educational level of father | ||
IMMIG | Immigration status | ||
LANG | Language at home | ||
DEFFORT | Difference in effort | ||
CULTPOSS | Classic literature, books of poetry, works of art | ||
HEDRES | Study desk, quiet place to study, computer for school work, educational software, own calculator, books to help with school work, dictionary | ||
WEALTH | Own room, internet link, dishwasher, DVD/VCR, three country-specific wealth items, number of cellphones, TVs, computers, cars |
Block 1 | Block 2 | ||
---|---|---|---|
Scale name | Scale description | Scale name | Scale description |
CARINFO | Student information on science-related careers | ENVOPT | Environmental optimism |
CARPREP | School preparation for science-related careers | ENVPERC | Perception of environmental issues |
ENVAWARE | Awareness of environmental issues | GENSCIE | General value of science |
INSTSCIE | Instrumental motivation in science | INTSCIE | General interest in learning science |
JOYSCIE | Enjoyment of science | PERSIE | Personal value of science |
SCIEFUT | Future-oriented science motivation | RESPDEV | Responsibility for sustainable development |
SCINTACT | Science teaching: interaction | SCAPPLY | Science teaching: focus on applications or models |
SCINVEST | Science teaching: student investigations | SCHANDS | Science teaching: hands-on activities |
SCSCIE | Science self-concept | SCIEACT | Science activities |
SCIEEFF | Science self-efficacy |
Three-form design: PISA 2012
Three-form design based on PISA 2012
Form | Common block | Rotation blocks | ||
---|---|---|---|---|
A | B | C | ||
1 | 1 | 1 | 1 | 0 |
2 | 1 | 1 | 0 | 1 |
3 | 1 | 0 | 1 | 1 |
Variable assignment to blocks in the three-form design
Form 1 | Form 2 | Form 3 |
---|---|---|
Common block | Common block | Common block |
Rotation bock 1 | Rotation block 2 | Rotation block 3 |
CARINFO | SCINTACT | PERSCIE |
CARPREP | SCINVEST | RESPDEV |
ENVAWARE | SCSCIE | SCAPPLY |
INSTSCIE | ENVOPT | SCHANDS |
JOYSCIE | ENVPERC | SCIEACT |
SCIEFUT | GENSCIE | SCIEEFF |
INTSCIE |
Partially balanced incomplete block design (Kaplan and Su 2016)
Partially balanced incomplete block design for the 19 questionnaire scales
Form | Scales | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
3 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
5 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 |
6 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
7 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
8 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
9 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
10 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
11 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 |
12 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
13 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
14 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
15 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
16 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 |
17 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
18 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
19 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
Simulation procedures
All analyses utilized the R programming environment R Core Team (2017). Functions for generating the PVs are given in Appendix A and functions to generate CQ simulations are given in Appendix B. We first create the matrix sampling designs on the original CQ data. Then in order to impute the missing data and to generate the PVs, we implement the SMI and MMI approaches of Weirich et al. (2014) with slight modifications. The simulation thus has six conditions in total, three matrix sampling designs by two approaches. The two approaches require us first to specify the item response model to obtain initial PVs, second impute the CQ missing data using the initial PVs, and finally using the imputed CQ data as the conditioning model to impute the final PVs. The difference between these two approaches is in the second step—the SMI or MMI methods for the CQ missing data. If we use SMI in the final step we generate five PVs based on the single imputed CQ. If we use MMI with five imputed CQs, in the final step 25 PVs will be generated since five PVs are generated using each of the five imputed CQs. We will then evaluate the distributions of the PVs under the six simulation conditions. The multiple PVs will also be used in the secondary analysis to explore the bias in the correlations between the scales and the PVs and bias in regression coefficients and their standard errors.
Calibration
In the first step, we specify the item response model. For this purpose, we use the “TAM” package (Kiefer et al. 2014) in the R software environment (R Core Team 2017) to scale the cognitive data. We implement a unidimensional one-parameter partial credit model with the ConQuest parametrization (Adams et al. 2015). The dimension is science and contains in total 102 cognitive items. Following the PISA 2006 technical report OECD (2009), we fix the item parameters at their international values and apply the sampling weights when specifying the item response model. Finally, five normally approximated PVs are generated (Chang and Stout 1993) without the conditioning model (i.e., without conditioning on the background information). In contrast with Weirich et al. (2014) which used the weighted maximum likelihood estimates (WLE) as proxies for individual proficiency scores to impute the CQ in the following step, we directly use the initial PVs that are generated in this step.^{5}
Imputing questionnaire data
The second step is to impute the CQ missing data. For each matrix sampling design, we implement SMI and MMI for the CQ missing data using predictive mean matching (PMM) via the R package MICE (van Buuren and Groothuis-Oudshoorn 2010). Previous research (Kaplan and Su 2016; Kaplan and McCarty 2013) has found predictive mean matching to be quite good with respect to meeting the requirements for the validity of statistical matching and imputation set down by Räassler (2002).
Predictive mean matching
Following van Buuren (2012), (see also; Kaplan and Su (2016)), predictive mean matching is implemented through a fully conditional specification approach that uses a univariate regression model consistent with the scale of the variable with missing data to provide predicted values of the missing data given the observed data. Once a variable of interest is filled-in, that variable, along with the variables for which there is complete data, is used in a sequence to fill in another variable. Once the sequence is completed for all variables with missing data, the posterior distributions of the regression parameters are obtained via Gibbs sampling and the process is started again. The algorithm can run these sequences simultaneously M number of times obtaining M imputed data sets.
- 1.
Obtain \(\hat{\beta }\) based on \(X_{obs}\) and let \(\tilde{\sigma }^2\) be a draw based on the deviations \((y_{obs}-X_{obs}\hat{\beta })'(y_{obs}-X_{obs}\hat{\beta })/\tilde{g}\), where \(\tilde{g}\) is a draw from a \(\chi ^2\) distribution.
- 2.
Draw \(\tilde{\beta }= \hat{\beta }+ \tilde{\sigma }\tilde{z}_1V^{1/2}\), where \(V^{1/2}\) is the square root of the Cholesky decomposition of the cross-products matrix \(S = X_{obs}'X_{obs}\), and \(z_1\) is a p-dimensional vector of N(0, 1) random variates.
- 3.
Calculate \(\tilde{\delta }(i,j) = |X_{obs,[i]}\hat{\beta }- X_{miss,[j]}\tilde{\beta }|\); \(i=1,2,\ldots ,n_1\), \(j=1,2,\ldots ,n_0\).
- 4.
Construct \(n_0\) sets \(W_j\), each containing d candidate donors from \(y_{obs}\), such that \(\sum _{d}\tilde{\delta }(i,j)\) is minimum. Break ties randomly.
- 5.
Randomly draw one donor \(i_j\) from \(W_j\) for \(j=1,2,\ldots ,n_0\).
- 6.
Impute \(\tilde{y}_j = y_{i_j}\), for \(j=1,2,\ldots , n_0\).
Generating final PVs
In the third step, we generate the final PVs conditioning on the imputed CQ data set. The item response model is the same as the calibration step except for conditioning on all the CQ variables in Table 1. Five normally approximated PVs are generated using one complete CQ data set. Thus for the MMI approach, we generate 25 PVs using the five imputed data sets. The generated PVs are all placed on the PISA scale (OECD 2009, p. 246). The final PVs are then used in the subsequent analyses.
Analysis
In the analysis step, we are interested in (1) how the distributions of the PVs under the six simulation conditions differ from the distributions of the PVs that are generated from the original questionnaire data; (2) how the correlations of CQ variables under the six conditions differ from the original questionnaire data; and (3) how regression coefficients differ across the six conditions compared to regression results from the original data.
To assess the distributions of PVs under the three planned missing data designs and two imputation approaches, we use the PVs conditioning on the original CQ data as the baseline comparison. The procedure for generating the PVs is the same as in the previous step: five PVs are generated by conditioning on the original data with all scales shown in Table 1. We calculate the mean and the standard errors of the PVs. The mean of PVs is simply the average across the five or 25 PVs. The standard errors under the SMI approach is pooled using Rubin’s rules (1987). The standard error under the MMI approach is pooled using the modified combining rules (Rubin 2003). In addition, we also conduct Kolmogorov Smirnov tests to compare the distributions of PVs.
Modified combining rules
To assess the regression coefficients under the six conditions, we conduct multiple regression analyses by regressing the multiple PVs on the selected scales. Note that we intentionally chose an analytic model which is simpler than the model that is used to generate PVs to reflect the realistic situation in which the researcher may not be aware of the full set of variables that were used in the conditioning model and is instead focusing his/her attention on a small set of theoretically motivated variables.
In the regression analysis, student sampling weights are added to reflect the complex sampling design of PISA 2006 (see OECD (2009), for more details). We then pool the 5 or 25 regression analyses according to Rubin’s rules (1987; 2003, respectively). To have a baseline to compare to, we use the coefficients and standard errors from the regression analysis based on the original data (with the same regression model). We calculate the standardized bias of the coefficient estimates as the difference between the pooled estimates and the counterpart estimates based on the original data, standardized by the standard deviation of the outcome variable. We calculate the ratio of variances as the squared pooled standard errors under each condition over the squared standard errors from the original data. It is expected that the ratio of the variances should be greater than one, since the standard errors under the six conditions must reflect the uncertainty due to the generation of PVs or imputation of the CQ data. The magnitude of the ratio depends on the choice of the design and the uncertainty from the generation of PVs or imputation of the CQ.
Results
In this section we first present the results of the marginal distributions of the PVs. This is followed by a presentation of the correlation bias. Finally, we present the results of regression analysis. We show that the marginal distributions of the PVs under the six simulation conditions do not differ from those that are generated from the original questionnaire data. The correlations among CQ variables differ across designs but not the methods. The estimates of regression coefficients differ across the design and the methods.
Marginal PV distributions
Pooled mean and standard error of PVs
Design | Approach | Pooled mean | Pooled SE |
---|---|---|---|
Original design | 489.004 | 216.964 | |
Two-form design | SMI | 489.111 | 141.671 |
MMI | 489.105 | 355.791 | |
Three-form design | SMI | 489.110 | 120.215 |
MMI | 489.253 | 388.075 | |
PBIB | SMI | 489.070 | 182.998 |
MMI | 489.228 | 322.087 |
Kolmogorov–Smirnov tests
Design | p value |
---|---|
Two-form SMI vs. original | 0.79 |
Two-form MMI vs. original | 0.94 |
Three-form SMI vs. original | 0.80 |
Three-form MMI vs. original | 0.93 |
PBIB SMI vs. original | 0.86 |
PBIB MMI vs. original | 0.98 |
Correlation bias
For the SMI and MMI approaches under each planned missing data design we calculate the pairwise correlations among all CQ scales, including those between CQ variables and PVs. In order to compute the bias in correlations, we use the correlations of the original CQ as the true correlation values. The bias in correlations is calculated as the difference between the averaged correlations across multiple imputed data sets and the true correlations from the original data.
Regression analysis
Figure 5 presents the standardized bias in the estimated regression coefficients across the designs for SMI and MMI respectively. The scales to the left of the vertical lines in each plot are the variables in the common block. In Fig. 5 we observe for both SMI and MMI methods, less bias is found for scales in the common block compared to scales in the rotated blocks because variables in the common block are not part of the planned missing data. As with the correlation bias results, the three-form design implemented in PISA 2012 shows the least amount of bias in the regression coefficients. For the SMI method, all variables in three-form design have the standardized biases within 0.08 in absolute value, followed by 93% of the variables in PBIB design and 87% of the variables in the two-form design. For the MMI method, the three-form design and PBIB design have standardized biases for all variables within 0.08 in absolute value, and only 90% in the two-form design. In addition the two-form design produces several more extreme biased coefficients, as can be seen in the scales SCINTACT, SCAPPLY, and SCHANDS.
Figure 6 presents the results of standard errors of the regression coefficients for the SMI and MMI methods. The plots show the ratio of the squared standard errors of regression coefficients from the three rotation designs over the original design. First, we observe almost all the ratios are larger than one as expected because the standard errors from the rotation designs have to account for the uncertainty due to missing data, while the original design does not contain any missing data, resulting in smaller standard errors. Second, we observe that the ratios of variables are much larger using MMI method than SMI method. This is also expected because multiple imputation results in larger standard errors than the single imputation. Third, for the scales in the common block, the ratios are much closer to one than the scales in rotation blocks because there is no missing data in the common block scales. Finally, across the designs, we observe that the ratios from the three-form design are smaller than those for the PBIB design and the two-form design. For the SMI method, none of the standard errors from the three-form design is 100% larger than from the original design, however 30% of the ratios in the PBIB design and 10% in the two-form design are larger than two. For the MMI method, the two-form design produces much larger standard errors than the other two designs, with 50% standard errors at least two times larger than the original design, followed by 47% in the PBIB design and 10% in the three-form design.^{7}
Conclusions
This paper expanded on earlier work by Weirich et al. (2014) in two ways. First, we showed that it is possible to use PVs simply and directly via nested multiple imputation. Consistent with the results of Weirich et al. (2014), we found that nested multiple imputation with PVs provides considerable bias reduction as expected under the framework of congenial missing data models. Also, consistent with Weirich et al. (2014), our findings showed relatively similar results for SMI and MMI. It should be pointed out that it is possible to implement a procedure that combines the PVs and CQ in one algorithm for simultaneous imputation (see Aßman et al. (2015)). The approach of Aßman et al. (2015) was studied under general missing data in the CQ but should be compared to the SMI and MMI approaches in the context of planned missing data in future studies. Second, we showed that the three-form design as implemented in PISA 2012 performed better in terms of correlation and regression bias reduction compared to the PBIB design examined in Kaplan and Su (2016) and the two-form design of Adams et al. (2013). The bias reduction in the partial balanced incomplete design is still better than what is achieved under the two-form design with the same overall amount of missing data. Further studies on investigating the variations of incomplete block designs for CQs are still needed because there are many other design possibilities (e.g., amount of missing data, missingness on items within scales etc.) that may be well-suited to large-scale educational assessments.
To conclude, the cumulative research on multiple imputation methods (Schafer and Graham 2002; Reiter and Raghunathan 2007; Harel 2007; Rubin 2003) applied to context questionnaires (Aßman et al. 2015; Adams et al. 2013; Kaplan and Su 2016; Weirich et al. 2014), shows relatively minimal impact on the marginal distributions of PVs and the joint relations of PVs with context questionnaire scales. The present study adds to the literature by comparing three planned missing data designs under two approaches to multiple imputation. Given that a common concern facing most national and international large-scale assessments is the desire to present as much content as possible without over-burdening the participants in the survey and furthermore given increased interest in the so-called “non-cognitive” outcomes of education we argue that the approach to questionnaire matrix sampling and imputation described in this paper should be given serious consideration.
For this paper, we focus on scales rather than the items that make up the scales. Matrix sampling of items within scales is a topic that is beyond the scope of this paper.
Associate classes are a feature of incomplete block designs and refer to the number of times a pair of scales (or items or variables) appear together. In a balanced incomplete block design, the associate classes are a constant—that is, the number of times a pair of scales appear together is the same for all pairs. For a partially balanced incomplete block design, we have multiple associate classes. The number of times a pair of scales appear together is different across pairs.
Declarations
Authors’ contributions
DK conceptualized the study and guided the design of the study, the statistical analysis, and contributed to drafting the manuscript. DS contributed to the design and development of the the software to conduct the analysis, carried out the analysis as well as contributed to drafting the manuscript. Both authors read and approved the final manuscript.
Acknowledgements
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The datasets supporting the conclusions of this article are available at http://bise.wceruw.org/index.html. The software supporting used to support the conclusions of this article are included within the article and also available at http://bise.wceruw.org/index.html.
Funding
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- Adams, R. J., Lietz, P., & Berezner, A. (2013). On the use of rotated context questionnaires in conjunction with multilevel item response models. Large-scale assessments in education. Retrieved from http://www.largescaleassessmentsineducation.com/content/1/1/5.
- Adams, R. J., Wu, M. L., & Wilson, M. R. (2015). ACER conquest 4.0. Melbourne: ACER.Google Scholar
- Aßman, C., Gaasch, C., Pohl, S., & Carstensen, C. H. (2015). Bayesian estimation in IRT models with missing values in background variables. Psychological Test and Assessment Modeling, 57, 505–618.Google Scholar
- Chang, H.-H., & Stout, W. F. (1993). The asymptotic posterior normality of the latent trait in an irt model. Psychometrika, 58, 37–52.View ArticleGoogle Scholar
- Frey, A., Hartig, J., & Rupp, A. A. (2009). An NCME instructional module on booklet designs in large-scale assessments of student achievement: Theory and practice. Educational Measurement: Issues and Practice, 28, 39–53.View ArticleGoogle Scholar
- Gonzalez, E., & Rutkowski, L. (2010). Principles of multiple matrix booklet designs and parameter recovery in large-scale assessments. IEA-ETS Research Institute Monograph, 3, 125–156.Google Scholar
- Graham, J. W., Taylor, B. J., Olchowski, A. E., & Cumsille, P. E. (2006). Planned missing data designs in psychological research. Psychological Methods, 11, 323–343.View ArticleGoogle Scholar
- Harel, O. (2007). Inferences on missing information under multiple imputation and two-stage multiple imputation. Statistical Methodology, 4, 75–89.View ArticleGoogle Scholar
- Kaplan, D., & McCarty, A. T. (2013). Data fusion with international large scale assessments: A case study using the OECD PISA and TALIS surveys. Large-scale assessments in education. Retrieved from http://www.largescaleassessmentsineducation.com/content/1/1/6.
- Kaplan, D., & Su, D. (2016). On matrix sampling and imputation of context questionnaires with implications for the generation of plausible values in large-scale assessments. Journal of Educational and Behavioral Statistics, 41, 51–80.View ArticleGoogle Scholar
- Kiefer, T., Robitzsch, A., & Wu, M. (2014). TAM: Test analysis modules (Computer software manual). Retrieved from http://CRAN.R-project.org/package=TAM (R package version 1.0-3.18-1).
- Masters, G. N. (1982). A rasch model for partial credit scoring. Psychometrika, 47, 149–174.View ArticleGoogle Scholar
- Masters, G., & Wright, B. (1997). The partial credit model. In W. J. van der Linden & R. K. Hambelton (Eds.), Handbook of modern item response theory (pp. 101–122). New York: Springer.View ArticleGoogle Scholar
- Meng, X. L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9, 538–558.View ArticleGoogle Scholar
- Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex samples. Psychometrika, 56, 177–196.View ArticleGoogle Scholar
- Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29, 133–161.View ArticleGoogle Scholar
- Montgomery, D. C. (2012). Design and analysis of experiments (8th ed.). New York: Wiley.Google Scholar
- OECD. (2009). PISA 2006 technical report. Paris: OECD.View ArticleGoogle Scholar
- OECD. (2014). PISA 2012 technical report. Paris: OECD.Google Scholar
- R Core Team. (2017). R: A language and environment for statistical computing (Computer software manual). Vienna, Austria. Retrieved from https://www.R-project.org/.
- Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Nielsen & Lydiche.Google Scholar
- Räassler, S. (2002). Statistical matching: A frequentist theory, practical applications, and alternative Bayesian approaches. New York: Springer.View ArticleGoogle Scholar
- Reiter, J. B., & Raghunathan, T. (2007). The multiple adaptions of multiple imputation. Journal of the American Statistical Association, 102, 1462–1471.View ArticleGoogle Scholar
- Rubin, D. B. (1987). Multiple imputation in nonresponse surveys. Hoboken: Wiley.View ArticleGoogle Scholar
- Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91, 473–489.View ArticleGoogle Scholar
- Rubin, D. B. (2003). Nested multiple imputation of NMES via partially incompatible MCMC. Statistica Neerlandica, 57, 3–18.View ArticleGoogle Scholar
- Rutkowski, L. (2011). The impact of missing background data on sub-population estimation. Journal of Educational Measurement, 48, 293–312.View ArticleGoogle Scholar
- Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7, 147–177.View ArticleGoogle Scholar
- Shoemaker, D. M. (1973). Principles and procedures of multiple matrix sampling. Oxford: Balinger.Google Scholar
- van Buuren, S., & Groothuis-Oudshoorn, K. (2010). Multivariate imputation by chained equations, version 2.3. Retrieved from http://www.multiple-imputation.com/.
- van Buuren, S. (2012). Flexible imputation of missing data. New York: Chapman & Hall.View ArticleGoogle Scholar
- von Davier, M. (2014). Imputing proficiency data under planned missingness in population models. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis. Boca Raton: Chapman Hall/CRC.Google Scholar
- von Davier, M., Gonzalez, E., & Mislevy, R. (2009). Plausible values: What are they and why do we need them? IERI Monograph Series: Issues and Methodologies in Large-Scale Assessments, 2, 9–36.Google Scholar
- Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450.View ArticleGoogle Scholar
- Weirich, S., Haag, N., Hecht, M., Böhme, K., Siegle, T., & Lüdtke, O. (2014). Nested multiple imputation in large-scale assessments. Large-scale assessments in education, 2 . Retrieved from http://www.largescaleassessmentsineducation.com/content/2/1/9.
- Wu, M. (2005). The role of plausible values in large-scale surveys. Studies in Educational Evaluation, 31, 114–128.View ArticleGoogle Scholar