 Research
 Open Access
 Published:
On imputation for planned missing data in context questionnaires using plausible values: a comparison of three designs
Largescale Assessments in Educationvolume 6, Article number: 6 (2018)
Abstract
Background
This paper extends a recent study by Kaplan and Su (J Educ Behav Stat 41: 51–80, 2016) examining the problem of matrix sampling of context questionnaire scales with respect to the generation of plausible values of cognitive outcomes in largescale assessments.
Methods
Following Weirich et al. (Nested multiple imputation in largescale assessments. In: Largescale assessments in education, 2. http://www.largescaleassessmentsineducation.com/content/2/1/9, 2014) we examine single + multiple imputation and multiple + multiple imputation methods using predictive mean matching imputation under three different context questionnaire matrix sampling designs: a twoform design studied by Adams et al. (On the use of rotated context questionnaires in conjunction with multilevel item response models. In: Largescale assessments in education. http://www.largescaleassessmentsineducation.com/content/1/1/5, 2013), a threeform design implemented in PISA 2012, and a partiallybalanced incomplete design studied by Kaplan and Su (J Educ Behav Stat 41: 51–80, 2016).
Results
Our results show that the choice of design has a larger impact on the reduction of bias than the choice of imputation method. Specifically, the threeform design used in PISA 2012 yields considerably less bias compared to the twoform design and the partially balanced incomplete design. We further show that the partially balanced incomplete block design produces less bias than the twoform design despite having the same amount of missing data.
Conclusions
We discuss the results in terms of implications for the design of context questionnaires in largescale assessments.
Introduction
A recent paper by Kaplan and Su (2016) investigated the problem of matrix sampling of context questionnaires with respect to the generation of the plausible values (PVs) of the socalled “cognitive” tests in largescale educational assessments. Drawing on earlier work by Adams et al. (2013) based on PISA 2012 OECD (2014) and motivated by the desire among policymakers to increase noncognitive content in national and international largescale assessments, Kaplan and Su found that matrix sampling of context questionnaire (CQ) material followed by predictive mean matching imputation can quite accurately recover the known marginal distributions of the PVs. However, bias was found in the estimation of correlations between CQ scales and PVs.^{Footnote 1} Kaplan and Su (2016) speculated that this bias was due to the fact that the plausible values were not part of the missing data imputation model and hence not “congenial” in the sense of Meng (1994).
In this paper, we investigate two approaches to multiple imputation that use PVs in the imputation model for the missing CQ data. To our knowledge, the two approaches discussed in Weirich et al. (2014) have not been studied across different missing data designs, and so an important feature of this paper is that we compare these approaches under three planned missing data designs: a twoform design examined by Adams et al. (2013), a threeform design that was used for PISA 2012, and a partially balanced incomplete block design (PBIB) studied by Kaplan and Su (2016). We carry out our investigation by simulating these designs on data from PISA 2006, allowing a comparison of our findings to the actual empirical data. We evaluate the marginal distributions of the PVs, the correlations among the imputed CQ variables and the PVs as well as the estimates of regression coefficients and their corresponding standard errors, by comparing to the original questionnaire without matrix sampling and imputation.
The organization of this paper is as follows. In the next section, we provide a review of the literature on missing data in largescale assessments by first suggesting that prior research on the topic can be situated within the framework of congenial missing data problems. This is followed by an overview of the matrix sampling designs used in this study. This is then followed by a description of the simulation design for this paper. Next, we provide the results of our simulation studies focusing on recovery of marginal distributions of PVs, bias in the correlations among the CQ and PVs, and bias in regression coefficients and their standard errors from a regression of the PVs on the CQ scales. The paper closes with a discussion of the results in light of recent calls for increased focus on the policy importance of context questionnaires in largescale assessments.
Background
As noted earlier, one finding of the Kaplan and Su (2016) paper was that correlations among the PVs and imputed CQ scales were biased. Kaplan and Su speculated that this bias was due, in part, to the fact that the PVs themselves were not included in the imputation models that they explored. Omitting the PVs as part of the imputation process leads to uncongeniality between the imputation model and the analysis model. This is a particular problem for secondary analyses of largescale assessments insofar as the PVs of the cognitive assessments are, arguably, of primary policy importance.
Congenial missing data problems
We situate our discussion of imputation under planned missing data within the framework of congenial missing data problems. The concept of congeniality in missing data problems was introduced by Meng (1994), (see also; Rubin (1996)). In outlining the steps in conducting a largescale survey, Meng (1994) pointed out that each step in the construction of a largescale survey inherits information from the previous step. That is, the data file that a researcher uses is the result of a set of design steps which includes, in important ways, decisions that are made regarding the imputation of missing data. In many cases, as Meng (1994) notes, the individual (or individuals) charged with decisions regarding missing data imputation has little or no contact with the enduser of the data. Thus, if an analyst is interested in conducting some secondary statistical analysis using the data, his/her statistical model may have little in common with the model used to impute the missing data and this “disconnect” can lead to serious biases. Quoting Meng (1994, p. 539)
“...uncongeniality... essentially means that the analysis procedure does not correspond to the imputation model. The uncongeniality arises when the analyst and the imputer have access to different amounts and sources of information, and have different assessments (e.g., explicit model, implicit judgement) about both responses and nonresponses. If the imputer’s assessment is far from reality, then, as Rubin (1995)^{Footnote 2} wrote, “all methods for handling nonresponse are in trouble” based on such an assessment; all statistical inferences need underlying key assumptions to hold at least approximately. If the imputer’s model is reasonably accurate, then following the multipleimputation recipe prevents the analyst from producing inferences with serious nonresponse biases.”
The problem of uncongeniality has led to the general principle that one should include as many variables as possible in the imputation model for the missing data (see e.g. Rubin (1996)). Considering that the PVs in the various knowledge domains are a central focus of largescale assessments, one purpose of this paper is to examine how PVs can be generated and used in the imputation of planned missing CQ data under different imputation methods and different designs and how these methods and designs impact the potential biases in secondary analyses.
Related research
Much of the extant literature addressing the topic of missing data in the CQ has focused on its impact with respect to the model used in generating population and subpopulation ability estimates of plausible values (e.g. Mislevy (1991); von Davier et al. (2009); Rutkowski (2011)) and on item or variable nonresponse (e.g. Aßman et al. (2015)). The general finding is that subpopulation estimates of plausible values are relatively stable under conditions of missingatrandom and notmissingat random (Rutkowski 2011). The present paper continues along the line of inquiry found in von Davier (2014) and Kaplan and Su (2016) focusing on planned missing data arising from a deliberate matrix sampling of the CQ. As noted earlier, we extend on this current work by examining alternative planned missing data designs and by examining alternative approaches to imputing missing data in the CQ.
There are many approaches to addressing missing data in the CQ when generating PVs. A rather ad hoc method implemented in PISA is to use country means for missing values and create dummy codes to indicate missingness in the CQ. This is then followed by a principal components analysis to reduce dimensionality and ease the computation of the PVs. The difficulty with the dummy coding approach, as pointed out by Aßman et al. (2015), is that it does not incorporate the PVs as part of the CQ imputation. This, in turn, results in an uncongenial missing data model and also does not address uncertainty arising from the missing data in the CQ. To explicitly address uncertainty arising from the missing data requires the use of multiple imputation methods (Rubin 1987). In principle, nonparametric or parametric methods can be used, however the question remains how the PVs can be incorporated into the imputation of the CQ missing data.
Rather than using dummy variables that are coded to address missing data, a general approach to incorporating PVs into the imputation of the CQ is through the use of multiple imputation (Rubin 1987). A discussion of multiple imputation using PVs for CQ imputation was provided by Weirich et al. (2014). In their paper, Weirich et al. (2014) distinguish between two approaches to multiple imputation in this context: single + multiple imputation (SMI) and multiple + multiple imputation (MMI). Following Weirich et al. (2014), four steps are required for the SMI approach. The first step is to create a of the item response model for the cognitive assessment without the use of the CQ. A simple marginal maximum likelihood approach can be used for this step. The second step involves imputation of the CQ using a proxy for the latent ability \(\theta\). Proxies could include simple percentage correct, maximum likelihood estimates (MLEs), or Warm weighted likelihood estimates (WLEs; Warm (1989)). However, should be noted that these proxies are biased estimates of latent ability. Indeed, an important contribution of our paper is that we will use the generated PVs directly in the imputation of the CQ. The third step requires estimating the parameters of a latent regression model in which the latent ability variable is regressed on the set of CQ variables, where the CQ is now completed due to the imputation from the second step. This step is required in order to impute plausible values of the latent ability distribution and is standard procedure in largescale assessments (see, e.g. von Davier (2014)).^{Footnote 3} The fourth step is the generation of the PVs based on a “completed” CQ.
As shown by Weirich et al. (2014), the SMI approach does reduce bias in the population model when the uncertainty in the CQ depends on \(\theta\). However, as they also note, the SMI approach is still not optimal because there exists uncertainty in the estimation of \(\theta\) due to missingness in the CQ. To fully address this uncertainty, Weirich et al. (2014) advocate for the MMI approach. The MMI approach is based on the notion of nested (or twostage) multiple imputation developed by Rubin (2003), (see also; Schafer and Graham (2002); Reiter and Raghunathan (2007); Harel (2007)). The basic steps of MMI require that in the second step described for SMI, M imputations of the CQ are created. Step 3, then, must be repeated M times, and then this is followed by Step 4 where, say, K plausible values are drawn from the posterior distribution of latent ability resulting in \(M \times K\) plausible values. As noted by Rubin (2003, p. 6), the usual combining rules under multiple imputation must be modified because nested imputations are correlated.
In an extensive simulation study, Weirich et al. (2014) found that the SMI and MMI approaches provided roughly comparable reduction of bias in the population model. They go on to suggest that one limitation of their study was the use of the WLEs as proxies for \(\theta\). As noted by Wu (2005), the problem with using WLEs as proxies for \(\theta\) is that it is a biased estimate of the population mean unless the same test items are given to all respondents—which is not the case with international largescale assessments such as PISA which utilize a balanced incompleteblock spiraling design for test booklets. Moreover, as pointed out by Mislevy et al. (1992), WLEs are susceptible to scale unreliability. We address the problem of using WLEs by estimating PVs directly in the imputation process.
The present paper focuses on the development of a complete data base for secondary statistical modeling and expands the work of Weirich et al. (2014) in several ways. First, as noted earlier, our focus is specifically on the problem of planned missing data designs in the CQ as implemented in the 2012 cycle of PISA rather than item/variable missing data. Second, we examine the SMI and MMI approaches to multiple imputation across three different designs that are relevant to largescale assessments. Finally, our focus is on the perspective of the secondary dataanalyst. Specifically, we focus on bias in correlations and regression coefficients derived from secondary studies, rather than bias in item parameters.
Matrix sampling designs for the context questionnaire
A classic study of matrix sampling designs can be found in Shoemaker (1973) who provided procedural guidelines and computational formulas for a variety of matrix sampling designs. More recently, Frey et al. (2009) provided a didactic discussion of matrix sampling designs, carefully outlining theoretical and practical implications for a variety of different designs. Gonzalez and Rutkowski (2010) also outlined a variety of matrix sampling designs and showed the impact of these designs on item and person parameter recovery in a simulation study of a largescale assessment.
In this section, we first introduce the PISA 2006 student context questionnaire data that we use in our study and then the three matrix sampling designs that we implement. The original context questionnaire of PISA 2006 contains all respondents’ background information. In order to investigate what would happen if we would have implemented a matrix sampling design on the context questionnaire, we simulate the twoform design, threeform design, and the PBIB design, using the US data of PISA 2006. To simulate a matrix sampling design, parts of respondents’ information are deleted (i.e., set to be missing) from the original CQ data. The missing information is then imputed so that the endusers have the complete data to conduct subsequent analyses.
Data
We use the 34 scales in the PISA 2006 context questionnaire as the background variables which are also used in the Adams et al. (2013) paper. Some of the scales are based on single items, such as GENDER and AGE. Others, such as science selfconcept (SCSCIE) are derived first from an IRT scaling of items constituting the construct. The resulting indices derived from the IRT scaling are then treated as manifest variables in the conditioning model. Based on the PISA 2006 technical report OECD (2009), dichotomous items were scaled using a oneparameter Rasch model Rasch (1960), and items with more than two response categories were scaled using the partialcredit model (Masters 1982, see also; Masters and Wright (1997)). Table 1 describes the 34 scales in the twoform matrix sampling design. The US data consist of 5611 respondents. The initial missing data from the respondents are imputed to make sure the original CQ does not contain any item or scale missing data.
Twoform design (Adams et al. 2013)
We arrange the original questionnaire according to the design in Adams et al. (2013) using a joint conditioning approach with two questionnaire forms. In the twoform design, three mutually exclusive blocks of scales in the questionnaire are created. Table 1 shows the twoform design studied by Adams et al. (2013). The first block, referred to as the common block with 15 scales, is assigned to both questionnaire forms. The remaining two blocks (block 1 containing 9 scales and block 2 containing 10 scales) are assigned to each of the questionnaire forms, respectively. Thus, each questionnaire form contains the common block and one of the two rotated blocks. The scales are allocated to the blocks according to the principle that the average correlation between science performance and the scales from block 1 is similar to the average correlation between science performance and the scales from block 2. We assign the common block to the 5611 respondents, then we randomly assign block 1 to half of the respondents and block 2 to the other half. This implies that respondents who receive block 1 have data deleted in block 2, and vice versa. In this design, the 15 scales in the common block do not have any missing data, while the 19 scales in blocks 1 and 2 have 50% of missing data. Because no respondents simultaneously receive blocks 1 and 2, we cannot estimate correlations or the interaction effects of the scales across the blocks using the traditional deletion methods (i.e., listwise or pairwise deletion). Even if we used multiple imputation or full information maximum likelihood methods to deal with missing data, the correlations and regression coefficients would still be biased due to this twoform design.
Threeform design: PISA 2012
The second design we explore is the threeform design (see Graham et al. (2006)). The threeform design was implemented in PISA 2012 and is a focus of attention in this paper insofar as PISA 2012 was the first largescale educational assessment to implement a CQ matrix sampling design in their main study. In this design, we keep the common block of the questionnaire scales the same as in the Adams et al. (2013) twoform design and then arrange the remaining 19 scales into the three mutually exclusive blocks A, B and C (see Table 2). The three questionnaire forms contain any of the two blocks. As Table 2 shows, form 1 contains blocks A and B, form 2 contains blocks A and C, and form 3 contains blocks B and C. In addition to the rotation blocks, each questionnaire form also contains the common block. We randomly assign the three forms to the respondents. The missing percentage of the variables in the rotation blocks is 33, 17% less than the twoform design. Because all pairs of scales have observed data the correlations and interactions of the variables across rotation blocks are estimable. The assignment of the actual scales to the three forms can be seen in Table 3.
Partially balanced incomplete block design (Kaplan and Su 2016)
The third design that we explore is a partially balanced incomplete block design (PBIB). This design was studied in Kaplan and Su (2016) but not in the context of employing generated PVs as part of the imputation of the missing CQ variables, nor in conjunction with the SMI or MMI approaches for multiple imputation. In this design, we keep the common block of the questionnaire scales the same as in Adams et al. (2013) design and then arrange the remaining 19 scales according to a PBIB design with three associate classes (Montgomery 2012).^{Footnote 4} The 19 scales are assigned to 19 forms but the missing percentage for each scale is still 50%, making it comparable to the twoform design. In our PBIB design, each cluster contains 9 or 10 scales as shown in Table 4. The scales are arranged in the 19 forms in such a fashion that all pairs of scales appear three, four or five times. For example, we see in Table 4 that scales 1 and 3 appear together 3 times, scales 1 and 2 appear together 4 times, and scales 1 and 7 appear together 5 times. We assign the common block to all 5611 respondents, then we randomly assign one of the 19 forms to each respondent. Respondents who get form 1 have data deleted on scale 1, 4, 7, 9, 13, and 15–19. As with the Adams et al. (2013) design, the 15 scales in the common block do not have any missing data, while the 19 scales have 50% missing data. Unlike the twoform deisgn, the PBIB design ensures that all pairs of scales have observed data.
Simulation procedures
All analyses utilized the R programming environment R Core Team (2017). Functions for generating the PVs are given in Appendix A and functions to generate CQ simulations are given in Appendix B. We first create the matrix sampling designs on the original CQ data. Then in order to impute the missing data and to generate the PVs, we implement the SMI and MMI approaches of Weirich et al. (2014) with slight modifications. The simulation thus has six conditions in total, three matrix sampling designs by two approaches. The two approaches require us first to specify the item response model to obtain initial PVs, second impute the CQ missing data using the initial PVs, and finally using the imputed CQ data as the conditioning model to impute the final PVs. The difference between these two approaches is in the second step—the SMI or MMI methods for the CQ missing data. If we use SMI in the final step we generate five PVs based on the single imputed CQ. If we use MMI with five imputed CQs, in the final step 25 PVs will be generated since five PVs are generated using each of the five imputed CQs. We will then evaluate the distributions of the PVs under the six simulation conditions. The multiple PVs will also be used in the secondary analysis to explore the bias in the correlations between the scales and the PVs and bias in regression coefficients and their standard errors.
Calibration
In the first step, we specify the item response model. For this purpose, we use the “TAM” package (Kiefer et al. 2014) in the R software environment (R Core Team 2017) to scale the cognitive data. We implement a unidimensional oneparameter partial credit model with the ConQuest parametrization (Adams et al. 2015). The dimension is science and contains in total 102 cognitive items. Following the PISA 2006 technical report OECD (2009), we fix the item parameters at their international values and apply the sampling weights when specifying the item response model. Finally, five normally approximated PVs are generated (Chang and Stout 1993) without the conditioning model (i.e., without conditioning on the background information). In contrast with Weirich et al. (2014) which used the weighted maximum likelihood estimates (WLE) as proxies for individual proficiency scores to impute the CQ in the following step, we directly use the initial PVs that are generated in this step.^{Footnote 5}
Imputing questionnaire data
The second step is to impute the CQ missing data. For each matrix sampling design, we implement SMI and MMI for the CQ missing data using predictive mean matching (PMM) via the R package MICE (van Buuren and GroothuisOudshoorn 2010). Previous research (Kaplan and Su 2016; Kaplan and McCarty 2013) has found predictive mean matching to be quite good with respect to meeting the requirements for the validity of statistical matching and imputation set down by Räassler (2002).
Predictive mean matching
Following van Buuren (2012), (see also; Kaplan and Su (2016)), predictive mean matching is implemented through a fully conditional specification approach that uses a univariate regression model consistent with the scale of the variable with missing data to provide predicted values of the missing data given the observed data. Once a variable of interest is filledin, that variable, along with the variables for which there is complete data, is used in a sequence to fill in another variable. Once the sequence is completed for all variables with missing data, the posterior distributions of the regression parameters are obtained via Gibbs sampling and the process is started again. The algorithm can run these sequences simultaneously M number of times obtaining M imputed data sets.
The PMM algorithm can be outlined as follows. Let \(X_{obs}\) be the predictors with observed data based on \(n_1\) observations (\(i=1,2,\ldots ,n_1\)), and let \(X_{miss}\) be the predictors with missing data on a target variable y based on \(n_0\) observations (\(j=1,2,\ldots , n_0\)).

1.
Obtain \(\hat{\beta }\) based on \(X_{obs}\) and let \(\tilde{\sigma }^2\) be a draw based on the deviations \((y_{obs}X_{obs}\hat{\beta })'(y_{obs}X_{obs}\hat{\beta })/\tilde{g}\), where \(\tilde{g}\) is a draw from a \(\chi ^2\) distribution.

2.
Draw \(\tilde{\beta }= \hat{\beta }+ \tilde{\sigma }\tilde{z}_1V^{1/2}\), where \(V^{1/2}\) is the square root of the Cholesky decomposition of the crossproducts matrix \(S = X_{obs}'X_{obs}\), and \(z_1\) is a pdimensional vector of N(0, 1) random variates.

3.
Calculate \(\tilde{\delta }(i,j) = X_{obs,[i]}\hat{\beta } X_{miss,[j]}\tilde{\beta }\); \(i=1,2,\ldots ,n_1\), \(j=1,2,\ldots ,n_0\).

4.
Construct \(n_0\) sets \(W_j\), each containing d candidate donors from \(y_{obs}\), such that \(\sum _{d}\tilde{\delta }(i,j)\) is minimum. Break ties randomly.

5.
Randomly draw one donor \(i_j\) from \(W_j\) for \(j=1,2,\ldots ,n_0\).

6.
Impute \(\tilde{y}_j = y_{i_j}\), for \(j=1,2,\ldots , n_0\).
The imputation model includes all 34 scales, school ID, cluster ID and the five initial PVs that were generated in the first step. The interaction terms between gender and all the other scales were also included and passive imputation was used. Passive imputation is a method for imputing functions (e.g. transformations or interactions) among incomplete variables when both the original and transformed variables are needed in the imputation models. For PV generation, the original variables and interactions are required and thus missing on each requires passive imputation (see van Buuren (2012), for more information). For the single imputation, we obtain one complete CQ data set and for the multiple imputations we obtain five complete CQ data sets.
Generating final PVs
In the third step, we generate the final PVs conditioning on the imputed CQ data set. The item response model is the same as the calibration step except for conditioning on all the CQ variables in Table 1. Five normally approximated PVs are generated using one complete CQ data set. Thus for the MMI approach, we generate 25 PVs using the five imputed data sets. The generated PVs are all placed on the PISA scale (OECD 2009, p. 246). The final PVs are then used in the subsequent analyses.
Analysis
In the analysis step, we are interested in (1) how the distributions of the PVs under the six simulation conditions differ from the distributions of the PVs that are generated from the original questionnaire data; (2) how the correlations of CQ variables under the six conditions differ from the original questionnaire data; and (3) how regression coefficients differ across the six conditions compared to regression results from the original data.
To assess the distributions of PVs under the three planned missing data designs and two imputation approaches, we use the PVs conditioning on the original CQ data as the baseline comparison. The procedure for generating the PVs is the same as in the previous step: five PVs are generated by conditioning on the original data with all scales shown in Table 1. We calculate the mean and the standard errors of the PVs. The mean of PVs is simply the average across the five or 25 PVs. The standard errors under the SMI approach is pooled using Rubin’s rules (1987). The standard error under the MMI approach is pooled using the modified combining rules (Rubin 2003). In addition, we also conduct Kolmogorov Smirnov tests to compare the distributions of PVs.
Modified combining rules
Following Rubin (2003), the modified combining rules are as follows: let Q represent a quantity of interest, let \(\hat{Q}^{(m,n)}\) represent the mean estimate of the \(m\text{{th}}\) PV \((m = 1,2,\ldots M)\), and let \(n = 1,2,\ldots N\) be the \(n\text{{th}}\) imputation of the CQ. Then,
is the overall average PV across imputations and nests. A vector of N mean estimates over the M PVs within each nest can be formed as
Let \(\bar{U}\) represent the average of the variance estimates over the N nests and M imputations, written as
Let \(MS^{(b)}\) be the betweennest mean square,
and let \(MS^{(w)}\) be the withinnest mean square
Then, as shown by Rubin (2003), the total imputation variance \((Q  \bar{Q})\) can be estimated by
We calculate the pairwise correlations among all CQ scales, including those between CQ variables and PVs. Because there are multiply imputed CQ data sets under MMI approach, we take the average of the correlations across the data sets. Then, the bias in correlations is calculated as the difference between the averaged correlations and the true correlations from the original data. As a baseline for comparison, we use the correlations from the original CQ data.
To assess the regression coefficients under the six conditions, we conduct multiple regression analyses by regressing the multiple PVs on the selected scales. Note that we intentionally chose an analytic model which is simpler than the model that is used to generate PVs to reflect the realistic situation in which the researcher may not be aware of the full set of variables that were used in the conditioning model and is instead focusing his/her attention on a small set of theoretically motivated variables.
In the regression analysis, student sampling weights are added to reflect the complex sampling design of PISA 2006 (see OECD (2009), for more details). We then pool the 5 or 25 regression analyses according to Rubin’s rules (1987; 2003, respectively). To have a baseline to compare to, we use the coefficients and standard errors from the regression analysis based on the original data (with the same regression model). We calculate the standardized bias of the coefficient estimates as the difference between the pooled estimates and the counterpart estimates based on the original data, standardized by the standard deviation of the outcome variable. We calculate the ratio of variances as the squared pooled standard errors under each condition over the squared standard errors from the original data. It is expected that the ratio of the variances should be greater than one, since the standard errors under the six conditions must reflect the uncertainty due to the generation of PVs or imputation of the CQ data. The magnitude of the ratio depends on the choice of the design and the uncertainty from the generation of PVs or imputation of the CQ.
Results
In this section we first present the results of the marginal distributions of the PVs. This is followed by a presentation of the correlation bias. Finally, we present the results of regression analysis. We show that the marginal distributions of the PVs under the six simulation conditions do not differ from those that are generated from the original questionnaire data. The correlations among CQ variables differ across designs but not the methods. The estimates of regression coefficients differ across the design and the methods.
Marginal PV distributions
The means and standard errors of PVs for the original data and under the six conditions (three designs × two imputation approaches) are presented in Table 5 and plots of the densities of PVs are shown in Fig. 1. Table 6 shows the p values of Kolmogorov–Smirnov tests when comparing the first PV under each of the six simulation conditions to the first PV of the baseline condition. We observe that the means of the marginal distributions are virtually identical across the designs and the methods. This result is consistent with Kaplan and Su (2016). However, in this case, we do observe sizably larger pooled standard errors from MMI versus SMI across designs. This is not unexpected insofar as MMI accounts for greater uncertainty in the imputation process. Interestingly, because the single imputation approach does not take full account of the uncertainty of the planned missing data, the standard errors from SMI method are smaller than from the original design. We also notice that the pooled standard errors are different across the designs due to the planned missing data patterns. The KolmogorovSmirnov test results show that the distribution of the first PVs are not significantly different from the PV of the original data.
Correlation bias
For the SMI and MMI approaches under each planned missing data design we calculate the pairwise correlations among all CQ scales, including those between CQ variables and PVs. In order to compute the bias in correlations, we use the correlations of the original CQ as the true correlation values. The bias in correlations is calculated as the difference between the averaged correlations across multiple imputed data sets and the true correlations from the original data.
Figures 2, 3, and 4 plot the correlation biases among CQ variables under MMI against the correlation bias for SMI across the twoform, threeform, and PBIB designs, respectively. We observe that the correlation bias is substantially lower for the threeform design implemented in PISA 2012, compared to the twoform and PBIB design. We find the PBIB design to perform better with respect to correlation bias compared to the twoform design, even though the overall amount of missing data for these two designs are the same. Little difference is found between the SMI or MMI approaches with respect to correlation bias across the planned missing data designs. For the correlations between CQ variables and PVs, we found no difference across the designs and the methods.^{Footnote 6}
Regression analysis
Figure 5 presents the standardized bias in the estimated regression coefficients across the designs for SMI and MMI respectively. The scales to the left of the vertical lines in each plot are the variables in the common block. In Fig. 5 we observe for both SMI and MMI methods, less bias is found for scales in the common block compared to scales in the rotated blocks because variables in the common block are not part of the planned missing data. As with the correlation bias results, the threeform design implemented in PISA 2012 shows the least amount of bias in the regression coefficients. For the SMI method, all variables in threeform design have the standardized biases within 0.08 in absolute value, followed by 93% of the variables in PBIB design and 87% of the variables in the twoform design. For the MMI method, the threeform design and PBIB design have standardized biases for all variables within 0.08 in absolute value, and only 90% in the twoform design. In addition the twoform design produces several more extreme biased coefficients, as can be seen in the scales SCINTACT, SCAPPLY, and SCHANDS.
Figure 6 presents the results of standard errors of the regression coefficients for the SMI and MMI methods. The plots show the ratio of the squared standard errors of regression coefficients from the three rotation designs over the original design. First, we observe almost all the ratios are larger than one as expected because the standard errors from the rotation designs have to account for the uncertainty due to missing data, while the original design does not contain any missing data, resulting in smaller standard errors. Second, we observe that the ratios of variables are much larger using MMI method than SMI method. This is also expected because multiple imputation results in larger standard errors than the single imputation. Third, for the scales in the common block, the ratios are much closer to one than the scales in rotation blocks because there is no missing data in the common block scales. Finally, across the designs, we observe that the ratios from the threeform design are smaller than those for the PBIB design and the twoform design. For the SMI method, none of the standard errors from the threeform design is 100% larger than from the original design, however 30% of the ratios in the PBIB design and 10% in the twoform design are larger than two. For the MMI method, the twoform design produces much larger standard errors than the other two designs, with 50% standard errors at least two times larger than the original design, followed by 47% in the PBIB design and 10% in the threeform design.^{Footnote 7}
Overall, due to less missing data, the threeform design produces less bias in the regression coefficients and smaller standard errors than the PBIB design. The PBIB design still performs better than the twoform design even though they have the same amount of missing data. This is because the PBIB design allows all pairs of scales to have data and so the correlations among all pairs of scales can be preserved, which is not the case with the twoform design.
Conclusions
This paper expanded on earlier work by Weirich et al. (2014) in two ways. First, we showed that it is possible to use PVs simply and directly via nested multiple imputation. Consistent with the results of Weirich et al. (2014), we found that nested multiple imputation with PVs provides considerable bias reduction as expected under the framework of congenial missing data models. Also, consistent with Weirich et al. (2014), our findings showed relatively similar results for SMI and MMI. It should be pointed out that it is possible to implement a procedure that combines the PVs and CQ in one algorithm for simultaneous imputation (see Aßman et al. (2015)). The approach of Aßman et al. (2015) was studied under general missing data in the CQ but should be compared to the SMI and MMI approaches in the context of planned missing data in future studies. Second, we showed that the threeform design as implemented in PISA 2012 performed better in terms of correlation and regression bias reduction compared to the PBIB design examined in Kaplan and Su (2016) and the twoform design of Adams et al. (2013). The bias reduction in the partial balanced incomplete design is still better than what is achieved under the twoform design with the same overall amount of missing data. Further studies on investigating the variations of incomplete block designs for CQs are still needed because there are many other design possibilities (e.g., amount of missing data, missingness on items within scales etc.) that may be wellsuited to largescale educational assessments.
To conclude, the cumulative research on multiple imputation methods (Schafer and Graham 2002; Reiter and Raghunathan 2007; Harel 2007; Rubin 2003) applied to context questionnaires (Aßman et al. 2015; Adams et al. 2013; Kaplan and Su 2016; Weirich et al. 2014), shows relatively minimal impact on the marginal distributions of PVs and the joint relations of PVs with context questionnaire scales. The present study adds to the literature by comparing three planned missing data designs under two approaches to multiple imputation. Given that a common concern facing most national and international largescale assessments is the desire to present as much content as possible without overburdening the participants in the survey and furthermore given increased interest in the socalled “noncognitive” outcomes of education we argue that the approach to questionnaire matrix sampling and imputation described in this paper should be given serious consideration.
Notes
 1.
For this paper, we focus on scales rather than the items that make up the scales. Matrix sampling of items within scales is a topic that is beyond the scope of this paper.
 2.
This paper was eventually published as Rubin (1996).
 3.
This latent regression is often referred to as the conditioning model or population model.
 4.
Associate classes are a feature of incomplete block designs and refer to the number of times a pair of scales (or items or variables) appear together. In a balanced incomplete block design, the associate classes are a constant—that is, the number of times a pair of scales appear together is the same for all pairs. For a partially balanced incomplete block design, we have multiple associate classes. The number of times a pair of scales appear together is different across pairs.
 5.
Preliminary analyses using WLEs show very little substantive difference.
 6.
Raw data tables of the correlation biases are available on request.
 7.
Raw data tables of the regression and standard error biases are available on request.
References
Adams, R. J., Lietz, P., & Berezner, A. (2013). On the use of rotated context questionnaires in conjunction with multilevel item response models. Largescale assessments in education. Retrieved from http://www.largescaleassessmentsineducation.com/content/1/1/5.
Adams, R. J., Wu, M. L., & Wilson, M. R. (2015). ACER conquest 4.0. Melbourne: ACER.
Aßman, C., Gaasch, C., Pohl, S., & Carstensen, C. H. (2015). Bayesian estimation in IRT models with missing values in background variables. Psychological Test and Assessment Modeling, 57, 505–618.
Chang, H.H., & Stout, W. F. (1993). The asymptotic posterior normality of the latent trait in an irt model. Psychometrika, 58, 37–52.
Frey, A., Hartig, J., & Rupp, A. A. (2009). An NCME instructional module on booklet designs in largescale assessments of student achievement: Theory and practice. Educational Measurement: Issues and Practice, 28, 39–53.
Gonzalez, E., & Rutkowski, L. (2010). Principles of multiple matrix booklet designs and parameter recovery in largescale assessments. IEAETS Research Institute Monograph, 3, 125–156.
Graham, J. W., Taylor, B. J., Olchowski, A. E., & Cumsille, P. E. (2006). Planned missing data designs in psychological research. Psychological Methods, 11, 323–343.
Harel, O. (2007). Inferences on missing information under multiple imputation and twostage multiple imputation. Statistical Methodology, 4, 75–89.
Kaplan, D., & McCarty, A. T. (2013). Data fusion with international large scale assessments: A case study using the OECD PISA and TALIS surveys. Largescale assessments in education. Retrieved from http://www.largescaleassessmentsineducation.com/content/1/1/6.
Kaplan, D., & Su, D. (2016). On matrix sampling and imputation of context questionnaires with implications for the generation of plausible values in largescale assessments. Journal of Educational and Behavioral Statistics, 41, 51–80.
Kiefer, T., Robitzsch, A., & Wu, M. (2014). TAM: Test analysis modules (Computer software manual). Retrieved from http://CRAN.Rproject.org/package=TAM (R package version 1.03.181).
Masters, G. N. (1982). A rasch model for partial credit scoring. Psychometrika, 47, 149–174.
Masters, G., & Wright, B. (1997). The partial credit model. In W. J. van der Linden & R. K. Hambelton (Eds.), Handbook of modern item response theory (pp. 101–122). New York: Springer.
Meng, X. L. (1994). Multipleimputation inferences with uncongenial sources of input. Statistical Science, 9, 538–558.
Mislevy, R. J. (1991). Randomizationbased inference about latent variables from complex samples. Psychometrika, 56, 177–196.
Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29, 133–161.
Montgomery, D. C. (2012). Design and analysis of experiments (8th ed.). New York: Wiley.
OECD. (2009). PISA 2006 technical report. Paris: OECD.
OECD. (2014). PISA 2012 technical report. Paris: OECD.
R Core Team. (2017). R: A language and environment for statistical computing (Computer software manual). Vienna, Austria. Retrieved from https://www.Rproject.org/.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Nielsen & Lydiche.
Räassler, S. (2002). Statistical matching: A frequentist theory, practical applications, and alternative Bayesian approaches. New York: Springer.
Reiter, J. B., & Raghunathan, T. (2007). The multiple adaptions of multiple imputation. Journal of the American Statistical Association, 102, 1462–1471.
Rubin, D. B. (1987). Multiple imputation in nonresponse surveys. Hoboken: Wiley.
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91, 473–489.
Rubin, D. B. (2003). Nested multiple imputation of NMES via partially incompatible MCMC. Statistica Neerlandica, 57, 3–18.
Rutkowski, L. (2011). The impact of missing background data on subpopulation estimation. Journal of Educational Measurement, 48, 293–312.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7, 147–177.
Shoemaker, D. M. (1973). Principles and procedures of multiple matrix sampling. Oxford: Balinger.
van Buuren, S., & GroothuisOudshoorn, K. (2010). Multivariate imputation by chained equations, version 2.3. Retrieved from http://www.multipleimputation.com/.
van Buuren, S. (2012). Flexible imputation of missing data. New York: Chapman & Hall.
von Davier, M. (2014). Imputing proficiency data under planned missingness in population models. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international largescale assessment: Background, technical issues, and methods of data analysis. Boca Raton: Chapman Hall/CRC.
von Davier, M., Gonzalez, E., & Mislevy, R. (2009). Plausible values: What are they and why do we need them? IERI Monograph Series: Issues and Methodologies in LargeScale Assessments, 2, 9–36.
Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450.
Weirich, S., Haag, N., Hecht, M., Böhme, K., Siegle, T., & Lüdtke, O. (2014). Nested multiple imputation in largescale assessments. Largescale assessments in education, 2 . Retrieved from http://www.largescaleassessmentsineducation.com/content/2/1/9.
Wu, M. (2005). The role of plausible values in largescale surveys. Studies in Educational Evaluation, 31, 114–128.
Authors’ contributions
DK conceptualized the study and guided the design of the study, the statistical analysis, and contributed to drafting the manuscript. DS contributed to the design and development of the the software to conduct the analysis, carried out the analysis as well as contributed to drafting the manuscript. Both authors read and approved the final manuscript.
Acknowledgements
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The datasets supporting the conclusions of this article are available at http://bise.wceruw.org/index.html. The software supporting used to support the conclusions of this article are included within the article and also available at http://bise.wceruw.org/index.html.
Funding
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Appendices
Appendix A
Appendix B
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI