 Research
 Open Access
 Published:
On the use of rotated context questionnaires in conjunction with multilevel item response models
Largescale Assessments in Educationvolume 1, Article number: 5 (2013)
Abstract
Background
While rotated test booklets have been employed in largescale assessments to increase the content coverage of the assessments, rotation has not yet been applied to the context questionnaires administered to respondents.
Methods
This paper describes the development of a methodology that uses rotated context questionnaires in conjunction with multilevel item response models and plausible values. In order to examine the impact of this methodology on the continuity of the results, PISA 2006 data for nine heterogeneous countries were rescaled after having been restructured to simulate the outcomes of the use of different rotated context questionnaire designs.
Results
Results revealed negligible differences when means, standard deviations, percentiles, and correlations were estimated using plausible values drawn with multilevel item response models that adopted different approaches to questionnaire rotation.
Conclusions
The results of the analyses support the use of rotated contextual questionnaires for respondents in order to extend the methodology currently used in largescale sample surveys.
Background
A common goal of sample surveys is to measure a latent variable proficiency, an aptitude, an attitude, or the like and then relate that latent variable to other characteristics of the respondents. For example, in an educational context, the relationship that is examined might be the correlation between the latent variable and another characteristic, such as years of schooling, or it might be betweengroup differences in mean scores for a latent variable. The ultimate aim of the survey is to examine the distribution of the latent variable in the target population and to make inferences concerning the relationships between latent variables and other variables in the target population. In psychometrics, the science of constructing measures of latent variables, it is generally accepted that measures of latent variables are fallible and include random error components that must be taken into consideration when such inferences are being made. Cochran (1968), for example, argues that when measurement error in latent variables is ignored, most statistical tests are vitiated.
The study of statistical models with error invariables is a welldeveloped area of statistical and psychometric inquiry. Its extensive body of literature dates back to at least Adcock (1878, cited in Gleser, 1981), and from there to Gleser (1981), Anderson (1984), Mislevy (1985), Fuller (1987), and Adams, Wilson, and Wu (1997). Econometricians were the first to extensively study models with errors in the variables, and their use in econometrics became widespread Anderson, (1984). In psychological and educational research, the presence of substantial measurement error resulted in the development of linear structural relation (or LISREL) models (see, for example, Jöreskog & Sörbom 1984; Muthén 2002) and latent regression (also known as multilevel item response theory) models (Adams, Wu, & Carstensen, 2007; Fox & Glas 2002).
In the context of largescale sample survey studies,^{a} multilevel item response theory models have been the method of choice for investigators undertaking appropriate data analysis in the presence of measurement error. There appear to be three primary reasons for this choice. First, the models are scalable; that is, they are methods that have been demonstrated to work well in contexts with many thousands of sampled respondents, many latent variables, and hundreds of manifest variables. Second, they can be integrated with other key components of sample survey methodology, in particular the weighting and sampling variance estimation that is required in structured multistage samples. And, third, they can be broken into discrete steps so that the study developers can construct a database and secondary analysts can then use standard and readily accessible analytic tools to analyze the data in ways that properly deal with the impact of the presence of measurement error (Adams, 2002;Adams, Wu, & Macaskill 2007; Gonzalez, Galia, & Li 2004;Mislevy 1990).
Researchers exploring PISA, NAEP, and TIMSS data have used the multilevel item response theory approach to examine the relationships between a small number of latent proficiency variables, for example, three to seven such variables in the case of PISA, and quite a large number of other variables collected via respondent contextual questionnaires. To ensure adequate content coverage of the latent proficiency variables, PISA, NAEP, and TIMSS all use multiple linked test booklets, which means that although each respondent responds to just 60 (NAEP) to 120 (PISA) minutes of assessment material, the total sum of assessment material used far exceeds this amount.
As noted, each of these studies routinely uses linked (or rotated) assessment booklets, a process often referred to as a multiplematrix sampling design (Shoemaker 1973). However, in order to broaden the assessment while limiting individual response burden, the studies rely on a single set of contextual variables being administered to all respondents. No attempt, as far as we are aware, has been made thus far to apply such a rotated design to the context questionnaires, and thereby extend the number of contextual variables beyond that which can be obtained from a single common questionnaire administered to all respondents. Gonzalez and Eltinge (2007 2007b), however, have discussed the possibility of using rotated questionnaires in the US Consumer Expenditure Quarterly Interview.
In this paper, we explore the possibility of administering rotated context questionnaires to respondents in order to expand the coverage of contextual variables in sample surveys that employ multilevel item response theory scaling models. In addition, we examine how a changed methodology might affect the continuity of results with respect not only to the latent proficiency variables themselves but also to their correlations with the context constructs. The specific context for our work is the PISA survey.
Although the idea of having rotated forms of the respondent context questionnaires in order to extend their content coverage is appealing, the situation for these questionnaires differs slightly from that for the test booklets. To illustrate this difference, we provide an overview of the PISA analysis approach and then follow it with an explanation of the difference between using data from the test booklets and using data from the respondent context questionnaire in the multilevel scaling model used in PISA.
The pisa analysis approach
PISA is a cyclical crosssectional study, with data collections occurring every three years. Four PISA assessments have now been completed (OECD 2001 2004 2007 2010) and a fifth is being implemented. Here we discuss the third cycle of PISA, the data collection that occurred in 2006 (referred to as PISA 2006). Our focus on the third cycle reflects our decision to use data from PISA 2006 to explore the potential use of rotated questionnaires.
PISA 2006 tested three subject domains, with science as the major domain and reading and mathematics as the minor domains. PISA allocates more assessment time to a major domain than it does to the minor domains, and typically reports subscales for major domains but not for minor ones. During PISA 2006, 108 test items, representing approximately 210 minutes of testing time, were used to assess student achievement in science. The reading assessment consisted of 28 items, and the mathematics assessment consisted of 48 items, representing approximately 60 minutes of testing time for reading and 120 minutes for mathematics.
The 184 main survey items were allocated to 13 halfhour (30minute) mutually exclusive item clusters that included seven science clusters, four mathematics clusters, and two reading clusters. Thirteen test booklets were produced, each composed of four clusters according to a rotated design. This approach resulted in 120minute test booklets consisting of two 60minute parts, each made up of two of the 30minute clusters and with students allowed a short break 60 minutes after the start of the test.
The booklet design was such that each cluster appeared in each of the four possible positions within a booklet exactly once, and each cluster occurred once in conjunction with each of the other clusters. Each test item, therefore, appeared in four of the test booklets. This linked design made it possible, when estimating item difficulties and student proficiencies, to apply standard measurement techniques to the resulting student response data (OECD, 2008). Student performance results were reported in terms of one overall scale in science, five science subscales, one overall scale for mathematics, and one overall scale for reading.
Fitting a multilevel item response model
The PISA research team used the mixed coefficients multinomial logit (MCML) model, as described by Adams, Wilson and Wang (1997a) and Adams and Wu (1997), to scale the 2006 data, and they used the ConQuest software (Wu et al. 1997) to carry out the process. Details of the scaling can be found in Adams (2002). We provide a limited sketch of the process here so as to contextualize the extension to the methodology that we explore in this paper.
The multilevel scaling model used consists of two components a conditional item response model, f _{ x }(x; ξ ξθ), and a population model, f _{ θ }(θ; γ, Σ, W). The conditional item response model describes the relationship between the observed item response vector x and the latent variables, θ. The ξξ parameters characterize the items. The population model, which describes the distribution of the latent variables and the relationship between the contextual variables and the latent variables, is a multivariate multiple regression model, where γ are the regression coefficients that are estimated, Σ is the conditional covariance matrix, and W are the contextual variables.
The conditional item response model and the population model are combined to obtain the unconditional, or marginal, item response model:
It is important to recognize that, under this model, the locations of respondents on the latent variables are not estimated. The parameters of the model are γ, Σ, and ξξ, where γ and Σ are the population parameters and ξ are the item parameters.
Directly estimating γ and Σ from the item response vectors bypasses the problem of having fallible estimates of latent proficiencies, that is, the issue of problems caused by measurement error, as discussed in the introduction. This approach also leads to unbiased estimates of population characteristics being obtained, assuming, of course, that the data satisfy the assumptions of the scaling and regression models.
The item response model used in (1) does not require the same complete list of item responses for all respondents. So, provided that the item response data are missing at random (Rubin 1976), which is the case with the rotated test booklet designs used in PISA, this model is well suited to incomplete designs. However, this is not the case for the population model, which in its PISA implementation requires complete data.
Plausible values
Currently, only a limited range of researchers are able to implement methodologies that permit the estimation of γ and Σ for the set of contextual variables that are of interest to them.^{b} Therefore, to support further analysis, PISA uses the imputation methodology usually referred to as plausible values (Mislevy 1991) during construction of its public access databases.
Plausible values are intermediate values that are used in the algorithm that is implemented in ConQuest to estimate the parameters of (1) (Volodin & Adams 1997). PISA plausible values are sets of imputed proficiencies that are provided, per respondent, for all latent variables included in the scaling. They are thus random draws from the estimated posterior proficiency distribution for each student. Adams (2002) details how the random draws are made. The theory supporting the use of the plausible value approach can be found in Rubin (1987) and Mislevy (1991); Beaton and Gonzalez (1995) provide an overview of how plausible values should be used.
A key feature of plausible values is that they allow the results obtained from fitting (1), in particular the regression coefficients γ and Σ, to be recovered without the need to access the specialist software required to fit model (1). They can also be used to estimate the parameters of any submodel of the regression model used in (1).
In PISA, the regression parameters in (1) are estimated on a country by country basis. Similarly, plausible values are drawn on a country by country basis. But before this model can be estimated, it is also necessary to select the contextual variables, W, that will be used in each country. In PISA and NAEP, these variables are referred to as conditioning variables. The steps used to prepare the conditioning variables in PISA are based upon those used in NAEP (Beaton 1987) as well as in TIMSS (Macaskill, Adams, & Wu, 1998), and are given below:

Step 1: Three variables (booklet ID, school ID, and gender) are prepared so they can be directly used as conditioning variables. Variables for booklet ID are represented by deviation contrast codes (Pedhazur 1997). Each booklet other than a reference booklet is represented by one variable. Variables for school ID are coded using simple contrast codes, with the largest school as the reference school.

Step 2: Each categorical variable in the student questionnaire is dummy coded. Details of this dummy coding can be found in the PISA 2006 technical report (OECD 2008). For variables treated as continuous (including questionnaire indices constructed using item response theory), missing values are replaced with the country mean, and a dummy variable indicating a missing response is created.

Step 3: For each country, a principal components analysis of the dummycoded categorical and continuous variables is performed, and component scores are produced for each student (The number of components retained must be sufficient to account for 95% of the variance in the original variables).

Step 4: The itemresponse model is fitted to each national dataset. The national population parameters are estimated using item parameters anchored at their international location, and conditioning variables are derived from the national principal components analysis and from Step 1.

Step 5: Five vectors of plausible values are drawn using the method described above. The vectors provide a plausible value for each of the PISA 2006 reporting scales.
The pool of candidate variables for W consists of the variables in the student contextual questionnaire. Until now, these have been limited to the number of variables that can be obtained through the administration of a single 30minute contextual questionnaire to all respondents. This situation raises the question of whether the multilevel item response theory methodology, which easily handles rotated assessment booklets, can be implemented with rotated contextual questionnaires. The question is asked because, in the case of rotated contextual variables, the set of candidate variables for W will differ for different respondents.
The motivation for using rotated questionnaires in order to extend the coverage of the student questionnaire stems from several somewhat related reasons. First, and just as is the case with the multiple matrix sampling for the test, it is desirable to limit the amount of time students are required to concentrate on completing a questionnaire while simultaneously providing opportunity to increase the content coverage of the questionnaires. Second, once various cognitive domains have been assessed, the number of variables or constructs that are thought to be related to performance in the different domains ends up being larger than if only one domain had been assessed. Third, given that in most countries the variance in performance between students exceeds the variance in performance between schools, it is necessary to seek out in the student questionnaire a greater number of variables for inclusion that can be used subsequently to describe performance differences between students.
However, before going down the path of rotating context questionnaires in PISA, we need to address three questions:

1.
Is it possible to develop a methodology that uses rotated contextual questionnaires?

2.
Will a change in the methodology to rotated questionnaires have an impact on the continuity of PISA results?

3.
Will such a change affect the estimated relationships between the context variables and performance?
In the following section, we explore five alternative approaches to allocating contextual variables to rotated booklets. We also examine the effects of these approaches on the estimated distributions of the latent proficiency. More specifically, we direct our analyses toward an examination of how the means and distributions of the latent proficiency variables are affected when two forms of the PISA student context questionnaire (StQ) are used. PISA rotates the two forms in a way that leads to all students being asked to respond to questions in a common part of the questionnaire and then leads to half of the students being asked to respond to questions in one of the two rotated parts of the questionnaire and the other half to the other rotated part of the questionnaire. Even more specifically, our analyses also address whether results differ depending on how student context constructs are assigned to the rotated forms.
We acknowledge the possibility of using other rotated questionnaire designs. One such design, for example, could involve three rotated forms, with each construct included in two of the three forms. Such an overlap would enable a wider range of subsequent analyses because it would allow calculation of the correlations between the constructs in the different forms. This approach, however, would reduce the additional space gained as a consequence of rotation. Furthermore, because the main aim of this paper is to examine the possible implications of a rotated questionnaire design for the proficiency estimates rather than search for the optimal questionnaire rotation design, we elected to use the twoform design.
Methods
To explore alternative approaches to using rotated context questionnaires, we rescaled the PISA 2006 data (OECD, 2007b) for nine countries after restructuring this information to simulate the outcomes of the use of rotated context questionnaires. During our research, we considered two rotation designs, each of which consisted of two questionnaire forms. In both designs, the two questionnaire forms shared a common set of variables. Each form also contained a variable set unique to it. To achieve this design, we divided the available pool of questions into three mutually exclusive subsets of variables. We assigned the first subset, the common set, to both questionnaire forms, and then assigned the second subset to the first rotated questionnaire form and the third subset to the second rotated questionnaire form.
The difference between the two rotation designs lies in the approach that we used to generate the variable sets in the rotated parts. While the common set was fixed to be the same in both designs, the method of constructing the rotated part differed. In Design 1, we constructed the variable sets so that each had a similar correlation with science performance (Variable Set 1.1 and Variable Set 1.2). In Design 2, we constructed the variable sets so that one had a lower correlation with science performance (Variable Set 2.1) and the other had a higher correlation with performance (Variable Set 2.2). This second design enabled us to ascertain if the correlations between the questions and performance would be likely if questions were assigned, in actuality, to the rotated parts of a questionnaire. In other words, the two designs allowed us to explore what impact the constructs in the rotated parts of the questionnaire had on various aspects of the proficiency estimates (i.e., means, standard deviations, percentiles, correlations), with that impact dependent on whether the constructs had a similar or a different relationship with performance.
Table 1 provides a summary of the two designs, and therefore illustrates how we allocated the variable sets to the questionnaire forms. Note, in particular, that the variable sets in the rotated parts of the questionnaire included constructs formed from individual variables and that the items forming a construct were not split across the questionnaire forms.
Table 2 sets out the variables contained in the common part of the questionnaire. These variables consisted of the major reporting variables—age, gender, grade, parental occupation and education, immigration status, age at which the student (if an immigrant) arrived in the country, and language spoken at home. The variable called effort in the table relates to a question that asked students to indicate the level of effort they put into the PISA achievement test compared with other tests they had taken. The three remaining constructs in the table are based on responses to questions regarding cultural and other possessions as well as educational resources available at home.
The allocation of constructs to the sets in the rotated parts of the questionnaire involved the following steps. We began by calculating, at the student level for each country, correlations between each construct and each of the proficiencies in the content domains (i.e., mathematics, reading, and science). We then used these results to compute the average countrylevel correlations between each variable set and the performance for all countries and for OECD countries only. Finally, we allocated the constructs to the two sets using the results of the second step so that the average correlations of the two sets with achievement at the student level were similar for rotated Forms 1.1 and 1.2 and differed for Forms 2.1 and 2.2.
Table 3 details which variables were allocated to the variable sets in the rotated parts of the questionnaire, and Table 4 provides the outcomes of that allocation in terms of correlations with science (as the major domain) proficiency in PISA 2006. We allocated the different constructs to the two forms of the questionnaire in such a way that responses from only half the students to each of the four sets of variables were retained, resulting in missing information on this set of variables from the other half of the student sample (see Table 1 above).
When scaling the data from each design, we implemented three approaches:

Common part conditioning: With this reduced conditional model, the information from only the variables in the common part of the questionnaire was used for comparative purposes.

Joint conditioning: Here, the two questionnaire forms were used jointly for each design. In practice, this meant having one conditioning model for each country and then setting the data for one set of variables to missing for one half of the students and setting the data for the other set of variables to missing for the other half of the students. Information from the common part for all students was included in the model.

Separate conditioning: This approach involved using the two questionnaire forms separately for each design, and that, in turn, meant running separate conditioning models, with one using only data from the first rotated form and the other using only data from the second rotated form. The information from the common part of the questionnaire for all students was also included in the model.
When taking the joint conditioning approach, we replaced missing information with the mean for that construct and also included a dummy variable indicating missing. We then replaced the missing information for the categorical variables with the mode for that variable and again included a dummy variable indicating missing. Thus, our analyses involved inclusion of two variables for each background item, one indicating the actual response of a student or the mean/mode if the response was missing, and the other variable indicating whether the response was not missing (=0) or missing (=1).
We acknowledge that in contexts where there is an interest in the estimates of the regression coefficients, this approach can produce biased results (Jones 1996; Rutkowski 2011). However, this line of argument may be partially irrelevant, or less important, in the current context, where concern lies with the outcomes of analysis based upon alternatively derived plausible values, rather than upon the direct estimates of the regression coefficients. In essence, the focus here is on the generation of the plausible values themselves, and not on the estimated regression coefficients obtained from an analysis of the plausible values. So while it may indeed be unwise to use dummy coding to deal with missing data when analyzing the final dataset, it does not follow that dummy coding should not be used when generating the plausible values. Any impact of this way of treating the structurally missing data caused by the rotation will be evident in the obtained results.
An alternative treatment for missing data, which we could have implemented as a fourth scaling approach, would have been to use imputations as a means of replacing missing information with “pseudoinformation.” However, imputations for missing data are modeldependent draws from the posterior distribution of random variables, conditional on the observed values of other available variables, and requiring use of estimated relationships between the variable that is missing and the remainder of the variables. In order to account for the uncertainty associated with these imputations, we would need to have multiple sets of data, a requirement that would increase the operational burden by a multiplier equal to the number of imputations (often 5). We therefore considered this approach to be a nonviable one.
Each of the above listed approaches to scaling followed the procedures that were implemented in the official OECD analyses of PISA 2006 data (OECD, 2007a). Descriptions of these approaches can be found in the PISA 2006 technical report (OECD, 2008). Our application of the three scaling approaches we used in combination with the two rotation designs (see Table 1) led to five sets of results for each of the three cognitive domains, namely mathematics, reading, and science.
Data
We purposely selected the countries that we included in our analyses because we wanted them to be fairly heterogeneous in terms of level of science performance, culture, language of instruction, and the mix of OECD and nonOECD countries. We considered that this approach would make exploration of the implications of the rotated questionnaire design in very different contexts easier and more valid. The nine countries that we eventually selected are listed in Table 5.
Results
The combination of the two rotation designs and the three scaling approaches led to the following five sets of results:

Set 1: This set of results, pertaining to the common part conditioning, is labelled “common” in the results tables in this section of the paper.

Sets 2 and 3: These two sets of results, for joint conditioning, are labelled “samecorrjoint” and “hilocorrjoint” in the results tables. The former denotes Design 1, in which the variable sets had similar correlations with performance, and the latter denotes Design 2, in which one variable set had high and one variable set had low correlations with performance.

Sets 3 and 4: These two sets of results relate to the separate conditioning. They are respectively labelled “samecorrsep” for Design 1, in which the variable sets had similar correlations with performance, and “hilocorrsep” for Design 2, in which one variable set had high and one variable set had low correlations with performance.
The results that we obtained from fitting the original PISA 2006 multilevel item response model that used all variables in the student questionnaire are labeled “original” in the results tables.
The comparisons that we report below between the results produced from the original PISA 2006 analyses and those obtained from the five rotation models are first those for the proficiency means and standard deviations, second those for the percentiles of the proficiency distributions, and third those for the correlations between proficiency and the context constructs. We considered the differences to be substantive if they exceeded the standard error of the corresponding estimate.
Means and standard deviations
The comparison of means and standard deviations between the plausible values generated from the five rotation models and the original plausible values revealed no differences of substantive importance with respect to performance in mathematics, reading, or science. The differences between the estimated means using each of the alternative rotation designs and those originally obtained are shown in Table 6 for mathematics, Table 7 for reading, and Table 8 for science. The differences for standard deviations are shown in Table 9 for mathematics, Table 10 for reading, and Table 11 for science.
We can see from Table 6 that the original PISA means for mathematics performance in the selected countries varied from 370 for Colombia to 547 for Hong Kong SAR, and the standard errors for the means were about 3.0 to 4.0 PISA points. As such, and within this context, we can consider the values reported in Table 6 to be very close to zero and therefore of no substantive importance.
In reading (Table 7), the original PISA means for the selected countries varied from 385 for Colombia to 536 for Hong Kong SAR, and the standard errors for the means in reading ranged from 2.4 (Hong Kong SAR) to 5.1 (Colombia) PISA points. Therefore, as was the case for mathematics, the differences in estimated means between the original results and the results of the alternative conditioning approaches reported in Table 7 can be considered trivial.
In science (Table 8), the original PISA means for the selected countries varied from 388 for Colombia to 542 for Hong Kong SAR, and the standard errors for the means in science ranged from 2.3 PISA points in Poland to 4.2 PISA points in the United States. Because none of the values reported in Table 8 came even close to the lower limit of the standard error of the original mean estimate, we can again consider the differences to be negligible.
In summary, the size of the reported differences between the means generated from the five rotation models and the original means indicates that essentially the same results emerged for each of the three domains.
Table 9 shows the standard deviations for the differences in mathematics performance between each of the alternative rotation designs and those originally obtained in the PISA database. Here we can see that the original PISA standard deviations in mathematics for the selected countries varied from 84 in Jordan to 99 for Germany, and the standard errors for the standard deviations ranged from 1.2 PISA points in Poland to 2.6 PISA points in Germany. Two of the values reported for mathematics in Table 9 exceeded the upper limit of this range. Both pertained to Colombia and both related to Rotation Design 2, where the correlation between the constructs and performance was higher in one of the rotated forms than in the other form.
Table 10 shows the differences between the estimated standard deviations in reading that resulted from each of the alternative rotation designs and those originally obtained. The original PISA standard deviations in reading for the selected countries varied from 82 in Hong Kong SAR to 112 in Germany, and the standard errors for the standard deviations ranged from 1.5 PISA points in Poland to 2.8 PISA points in France. Nineteen of the values reported for reading in Table 10 exceeded the upper limit of this range.
It is noteworthy that not one of the differences in Table 10 is associated with the common part conditioning (=“common”) approach, which used only the variables in the common part of the questionnaire. In contrast, all differences exceeded the upper limit for the scaling approach in which the two questionnaire forms were used separately and where one variable set had high and one variable set had low correlations with performance (=hilocorrsep).
In Table 11 (science), the original PISA standard deviations for the selected countries vary from 85 in Colombia to 107 in the United States, and the standard errors for the standard deviations range from 1.1 PISA points in Poland to 2.1 PISA points in France. Seven of the values reported for science in Table 11 exceeded the upper limit of this range. All of these differences were associated with Rotation Design 2, which means that the correlations between the variable set and performance in one of the rotated forms were consistently higher than the correlations in the other rotated form. In contrast, with respect to Rotation Design 1, where the constructs in each form had similar correlations with performance, no difference exceeded the upper limit of the standard error associated with the original estimate of the standard deviation.
Percentiles
Although we compared for all countries of interest the percentiles of the distributions of the plausible values based on the five rotation models and the original plausible values, we decided, for the sake of brevity, to report only the results for Colombia and Poland in this paper. Our reason for this choice is that the sets of results for these two countries showed the most variance. Table 12 presents the findings for Colombia, and Table 13 the findings for Poland.
Scrutiny of these tables shows that, in general, the differences between the plausible values drawn using each of the five rotation models and the original plausible values are larger in the tails of the distributions (namely the 5th and 10th percentiles at the bottom end and the 90th and 95th percentiles at the top end) than they are in the middle of the distributions. The largest absolute difference is recorded for the estimates of the 5th percentile in reading for Colombia. However, due to the relatively large standard error associated with these estimates (i.e., 5 to 11 PISA points in reading for Colombia), we can consider the differences between this and the original estimate to be immaterial.
While none of the differences between the original estimates and the estimates based on the rotated questionnaire models is of substantive importance, it is still interesting to examine the largest difference for each domain. In mathematics, the largest difference of five PISA points is recorded several times in Tables 12 and 13. This fivepoint difference can be noted in Poland for the 10th percentile between the separate conditioning using Rotation Design 2 (hilocorrsep) and the original estimate. In Colombia, the difference is apparent in both the 5th and 95th percentile estimates for both the joint conditioning and the separate conditioning using Rotation Design 2 and in the 90th percentile estimate for the separate conditioning using Rotation Design 2 (hilocorrsep).
In reading, the largest difference, 17 PISA points, emerged for the 5th percentile estimate in Colombia. This difference was the one between separate conditioning with Rotation Design 2 (hilocorrsep) and the original estimate. In science, the largest difference of 11 PISA points was again found for the 5th percentile in Colombia, but this time the difference was between joint conditioning using Rotation Design 2 (hilocorrjoint) and the original estimate.
Thus, despite none of the differences in percentiles being of substantive importance, we can detect a pattern whereby differences were somewhat larger when Rotation Design 2 was involved. As a reminder, this design involved allocating constructs that were more highly correlated with performance to one of the rotated forms of the questionnaire and assigning the constructs with lower correlations with performance to the other rotated form.
Correlations with context constructs
We calculated, for six of the nine countries under review, correlations between all 28 context constructs and the plausible values drawn using each of the five rotations. These new correlations were thus based on the reduced sets of data, that is, the variables in each of the two forms using responses from only half of the students, with the other half of the student sample set to missing. We then compared these 2,520 new correlations (6 countries × 28 constructs × 5 rotation designs × 3 domains) with the correlations with the original plausible values.
Data on only 24 of these constructs were available for France, Hong Kong SAR, and the United States. In addition, due to an error in the printing of the reading booklets, no reading proficiency estimates were available for the United States. This meant that, for these three countries, 960 correlation coefficients were calculated (2 countries × 24 constructs × 5 rotation designs × 3 domains + 1 country × 24 constructs × 5 rotation designs*2 domains) and compared to the correlation coefficient between a certain context construct and the original plausible values, resulting in a grand total of 3,480 comparisons.
Our summarizing of the resulting information involved two steps. Our intention with the first step was to find out if we could observe a general trend in terms of changes in the sizes of the coefficients between the five rotation estimates and the original estimates that used complete data on all variables. Our aim during the second step was to conduct a review at the construct level in order to identify possible patterns indicating where changes might have occurred.
During the first step, we calculated the mean of the correlations across the constructs for each domain and each country. Next, we computed the ratio of the mean correlations for the five rotation designs to the mean correlation without rotation. We then averaged the ratios over countries to obtain a grand mean ratio. Table 14 presents the results.
The table has three main sections mathematics, reading, and science. The first column of each section gives the original correlation across all context constructs with mathematics proficiency, and the next five columns give the correlations across all context constructs for each of the five rotation designs:

The common part (common);

The joint scaling of the variable sets that had the same correlation with achievement (samecorrjoint);

The separate scaling of the variable sets that had the same correlation with achievement (samecorrsep);

The joint scaling of the variable sets in which one had a high and the other a low correlation with performance (hilojoint); and

The separate scaling of the variable sets in which one had a high and the other a low correlation with performance (hilosep).
If, for example, we look at Colombia in Table 14, it is apparent that the mean correlation across all of the context constructs with mathematics is the same for the original as well as for all five rotation designs, namely 0.06. The same applies in science, where the mean correlation across all context constructs and performance is 0.07, regardless of the rotation design. Slight differences only are apparent in reading, where the original average correlation between all background constructs and performance is 0.05, but is lower (0.04) for the design with the separate conditioning of variable sets with the same correlation (samecorrsep) and higher (0.06) for both of the designs that involved the use of joint conditioning. However, the size of these differences is not substantive.
The second row for each country in Table 14 sets out the differences in the form of ratios between the original correlation and the correlation estimate based on the five rotated designs. In many instances, the ratios are 0.99, 1.00, or 1.01, indicating very little differences between the estimates.
In combination, these results reveal no pattern of upward or downward change between the correlation estimates from the rotation designs when compared with the original correlation estimates across the very different countries in the analyses. Thus, for example, the differences were no more pronounced in a country with relatively higher correlations between the context constructs and performance, such as France, than they were in a country with lower correlations, such as Colombia.
During the second resultssummarization step (taken with the aim of reviewing the results at the construct level), we recorded only the 82 instances of the 3,480 correlations where the absolute differences between the correlation coefficients exceeded 0.03. This decision was based on the fact that such a difference would exceed the standard errors of the corresponding estimates, which, in PISA, are usually less than 0.02. Details concerning the differences that emerged appear in Table 15 for mathematics, Table 16 for reading, and Table 17 for science.
As can be seen, the number of correlation coefficients with context constructs exhibiting differences between the rotation results and the original results varies across the three subject domains. The smallest number of differences are recorded for mathematics—a minor domain in PISA 2006. Here, only 19 differences are larger than 0.03. For science, 28 differences exceed that size. The domain recording the most differences is reading. The only sizeable difference for a country is that for Hong Kong SAR. Across the countries, differences are apparent for between 4 (Russian Federation) and 13 constructs (Colombia), with a total of 63 differences exceeding 0.03 in reading.
In order to investigate whether any of the constructs or rotation results was more prone than others to being involved in the differences, we summed their occurrences across countries and domains. Awareness of environmental issues (ENVAWARE) was the construct for which most of the differences (15) were recorded. Results for the rotation plausible values based on the common part conditioning showed the smallest number of differences (i.e., five) compared with the original plausible values. In contrast, the largest number of differences involved the rotation results for the 41 occurrences recorded for separate conditioning using Rotation Design 2 and the 35 occurrences recorded for joint conditioning using Rotation Design 2. Hence, the rotation results based on the design in which one form contained constructs that were more highly correlated with performance and in which the other form contained constructs that correlated less with achievement seemed more likely than the other three rotation results to differ from the original results.
With three exceptions, the correlation coefficients were higher between the constructs and the original results than between the constructs and the rotation results. The exceptions included the construct measuring the general value of science (GENSCIE) and its correlation with mathematics performance in Hong Kong SAR and for the correlations between that same construct and the two different reading proficiency estimates in Germany.
A final finding was that the estimated correlations between the context constructs and performance tended to be smaller for the plausible values that were generated from the five rotation models compared with those generated for the original plausible values.
Discussion and conclusions
As Rutkowski (2011) notes, despite the fact that latent regression is well established both theoretically and practically as an analytic approach in sample surveys, there is a dearth of literature concerning various implications of and threats to the application of this methodology. This paper has added to that literature by exploring some of the implications of using rotated contextual questionnaires for respondents so as to expand the coverage of contextual variables while still placing a reasonable limitation on respondent time.
Our modeling, using PISA 2006 data, of the potential impact of the use of rotated questionnaires revealed very similar results regardless of whether we scaled the data using rotated context questionnaires or nonrotated questionnaires. Indeed, differences in terms of estimated means, standard deviations, and percentiles tended to be slight and were therefore nearly all of no substantive importance. Likewise, our comparison of mean correlations across all context constructs with performance between rotation and original results and the corresponding ratios of differences revealed no general upward or downward trend in estimates.
Our analyses furthermore revealed very few substantive differences between the plausible values generated from the rotation models and the original plausible values. This outcome leads to the following conclusions with respect to (a) the possibility of developing a methodology using rotated context questionnaires, (b) the possible impact on the continuity of results, and (c) correlations between context variables and performance.
First, the research shows that it is possible to develop a methodology that uses rotated contextual questionnaires in conjunction with multilevel item response models. The three approaches to scaling that we explored in this paper involved common part conditioning, joint conditioning, and separate conditioning. The common part conditioning used information from only the variables in the common part of the questionnaire, the joint conditioning employed information from the rotated parts jointly, and the separate conditioning used information from the rotated parts separately. We paired these latter two approaches with the two questionnaire rotation designs. In Design 1, the constructs in each rotated form showed similar correlations with performance. In Design 2, the constructs were assigned to forms in such a way that the constructs in one form had relatively higher correlations with performance whereas the constructs in the other form had relatively lower correlations with performance.
Second, the comparison of the results from these five rotation models and the original results showed little, if any, impact on the continuity of PISA results in terms of means, standard deviations, and percentiles. This meant that we found no substantive differences between estimates of the mean based on the original plausible values and estimates of the mean based on the plausible values obtained from the five models using a rotated studentcontextquestionnaire design in mathematics, reading, or science.
We found a number of relatively robust differences between the standard deviations based on original plausible values and those based on plausible values generated from the questionnaire rotation models. The large majority of these instances emerged in the minor domain of reading and with respect to Rotation Design 2, in which we assigned constructs to forms in such a way that the constructs in one form had relatively higher correlations with performance than the constructs in the other form. Our comparison of the percentiles of the distributions of the plausible values based on the five rotation models and the original plausible values revealed no substantive differences in any of the domains.
Third, our comparison of the estimated correlation coefficients between the context variables and the five rotation plausible values on the one hand and the original plausible values on the other hand revealed some nontrivial differences, ranging from 19 differences in science to 63 differences in reading. Most of these differences were associated with the separate conditioning approach used in conjunction with Design 2 (hilocorrsep). This evidence, combined with the results for the standard deviations, suggests a preference for the Design 1 approach, where constructs are assigned to rotated forms in such a way that their correlations with performance are similar across forms.
There might have been an expectation that excluding a variable from the conditioning model would bias the subsequent estimates of the correlation between that variable and outcomes toward zero, given that the bias is a function of the marginal explanatory power of that variable (see, for example, Mislevy 1991). However, we did not exclude variables from the conditioning during our current analyses and we did not, of course, neglect to carry out conditioning. Indeed, all of the rotation designs that we examined involved some form of conditioning.
Of note with regard to the four designs in which we used responses from only half of the respondents is the fact that we conducted the conditioning using the data from this half and then set the other half to missing, thereby ignoring this information during the analyses. In other words, these four designs excluded from the analyses students who had not responded to the questions, which meant that information relating to them was absent from the conditioning. In this sense, the rotation designs paralleled the original analyses, which included all students who had responded to these questions and included all of this information in the conditioning. In this respect, the only difference between these analyses and ours is that our estimates were based on a smaller number of cases, namely half, which meant that any reduction in the size of estimates would only be a consequence of the smaller number of cases available in the analyses.
Another conclusion that can be drawn from our results is that additional information obtained from the context questionnaire adds very little to the estimation of the latent variable, and consequently has a negligible influence on the plausible values. This, of course, is not surprising because nearly all of the information concerning the latent variables for each respondent came from the two hours of cognitive testing in the content domains during which the PISA latent scales were measured with high reliability.
Indeed, the robustness of the results from our scaling approach that used the common part of the context questionnaires indicates that—for the purpose of obtaining plausible values—the questions in those questionnaires could be reduced to a core set, such as gender, parental education and occupation, migration, home language, and home possessions. However, in PISA, the contextual variables are not included in the scaling as a means of improving the reliability of the plausible values. Rather, they are included so that the contextual factors possibly associated with performance can be subsequently analyzed. Importantly, what we are showing here is the potential that a rotated design has to broaden the range of contextual variables included, and therefore increase the relevance of the assessment for policymakers, educators, and researchers, while simultaneously allowing the respondent time to be kept to approximately 30 minutes a length consistent with most current largescale assessments.
In terms of which particular rotation design might be preferable, our findings indicate that the population parameter estimates that are based on the rotated forms (i.e., one form containing constructs that are more highly correlated with achievement and the other form containing constructs that correlate less with achievement) are more prone to differ from the population parameter estimates based on the original plausible values than on the population parameter estimates based on plausible values generated using the other rotation models considered in this paper. Thus, it would seem desirable to assign constructs to forms in a way that means the constructs in each rotated form correlate similarly with performance. This approach could be achieved by basing the assignment of constructs to forms on the results obtained from field trials.
Finally, while further work using other datasets and other types of analyses seem desirable to provide further evidence, the outcomes of our research support using rotated contextual questionnaires for respondents in order to extend the methodology currently used in largescale sample surveys. We consider such an extension presents good news for researchers and respondents alike, because it would permit a broadening of coverage, a reduction in response time, or both.
Endnotes
^{a}Examples include the Organisation for Economic Cooperation and Development’s (OECD) Programme for International Student Assessment (PISA), the International Association for the Evaluation of Educational Achievement’s (IEA) Trends in International Mathematics and Science Study (TIMSS), and the US National Assessment of Educational Progress (NAEP).
^{b}Such tools are, however, available in the public domain and fully described in the literature (Fox & Glas 2002; Sinharay & von Davier 2005; Volodin & Adams 1997;; Wu, Adams, & Wilson Wu, Adams, & Wilson ; Wu, Adams, & Wilson 1997).
References
Adams RJ: Scaling PISA cognitive data. In PISA 2000 technical report. Edited by: Adams RJ, Wu ML. Paris, France: OECD Publications; 2002.
Adams RJ, Wilson MR, Wang WC: The multidimensional random coefficients multinomial logit model. Appl Psychol Meas 1997a, 21: 1–23. 10.1177/0146621697211001
Adams RJ, Wilson MR, Wu ML: Multilevel item response modelling: An approach to errors in variables regression. J Educ Behav Stat 1997b, 22: 47–76.
Adams RJ, Wu ML: The mixedcoefficient multinomial logit model: A generalized form of the Rasch model. In Multivariate and mixture distribution Rasch models: Extensions and applications. Edited by: von Davier M, Carstensen CH. New York, NY: Springer; 2007:57–76.
Adams RJ, Wu ML, Carstensen CH: Application of multivariate Rasch models in international large scale educational assessment. In Multivariate and mixture distribution Rasch models: Extensions and applications. Edited by: von Davier M, Carstensen CH. New York, NY: Springer; 2007:271–280.
Adams RJ, Wu ML, Macaskill G: Scaling methodology and procedures for the mathematics and science scales. Implementation and analysis. In TIMSS technical report, Vol. II. Edited by: Martin MO, Kelly DL. Chestnut Hill, MA: Boston College; 1997c:111–145.
Anderson TW: An introduction to Multivariate statistical analysis. New York, NY: John Wiley & Sons; 1984.
Beaton AE: Implementing the new design: The NAEP 1983–84 technical report (Report No. 15TR20). Princeton, NJ: Educational Testing Service; 1987.
Beaton AE, Gonzalez EJ: The NAEP primer. Center for the Study of Testing, Evaluation, and Educational Policy. Chestnut Hill, MA: Boston College; 1995.
Cochran WG: Errors of measurement in statistics. Technometrics 1968, 10: 637–666. 10.2307/1267450
Fox JP, Glas CAW: Modeling measurement error in a structural multilevel model. In Latent variable and latent structure models. Edited by: Marcoulides GA, Moustaki I. London, UK: Lawrence Erlbaum Associates; 2002:245–269.
Fuller WA: Measurement error models. New York, NY: John Wiley & Sons; 1987.
Gleser LJ: Estimation in a multivariate errorsinvariables regression model: Large sample results. Ann Stat 1981, 9: 24–44. 10.1214/aos/1176345330
Gonzalez JM, Eltinge JL: Mulitple matrix sampling: A review. In Proceedings of the Section on Survey Research Methods, American Statistical Association. Alexandria, VA: American Statistical Association; 2007a:3069–3075.
Gonzalez JM, Eltinge JL: Properties of alternative sample design and estimation methods for the consumer expenditure surveys. Arlington, VA: Paper presented at the 2007 Research Conference of the Federal Committee on Statistical Methodology; 2007b.
Gonzalez EJ, Galia J, Li I: Scaling methods and procedures for the TIMSS 2003 mathematics and science scales. In TIMSS 2003 technical report. Edited by: Martin MO, Mullis IVS, Chrostowski SJ. Chestnut Hill, MA: Boston College; 2004.
Jones M: Indicator and stratification methods for missing explanatory variables in multiple linear regression. J Am Stat Assoc 1996,91(433):222–230. 10.1080/01621459.1996.10476680
Jöreskog KG, Sörbom D: LISREL VI: Analysis of linear structural relationshipos by maximum likelihood, instrumental variables and least squares methods. Mooresville, IN: Scientific Software; 1984.
Macaskill G, Adams RJ, Wu ML: Scaling methodology and procedures for the mathematics and science literacy, advanced mathematics and physics scales. In Third International Mathematics and Science Study, technical report: Vol. 3. Implementation and analysis. Edited by: Martin M, Kelly DL. Chestnut Hill, MA: Boston College; 1998.
Mislevy RJ: Estimation of latent group effects. J Am Stat Assoc 1985, 80: 993–997. 10.1080/01621459.1985.10478215
Mislevy RJ: Scaling procedures. In Focusing the new design: The NAEP 1988 technical report (No. 19TR20, pp. 229–250). Edited by: Johnson EG, Zwick R. Princeton, NJ: Educational Testing Service; 1990.
Mislevy RJ: Randomizationbased inference about latent variables from complex samples. Psychometrika 1991,56(2):177–196. 10.1007/BF02294457
Muthén BO: Beyond SEM: General latent variable modelling. Behaviormetrika 2002,29(1):81–117. 10.2333/bhmk.29.81
Organisation for Economic CoOperation and Development (OECD): Knowledge and skills for life: First results from PISA 2000. Paris, France: OECD Publications; 2001.
Organisation for Economic CoOperation and Development (OECD): Learning for tomorrow’s world: First results from PISA 2003. Paris, France: OECD Publications; 2004.
Organisation for Economic CoOperation and Development (OECD): PISA 2006: Science competencies for tomorrow’s world. Paris, France: OECD Publications; 2007a.
Organisation for Economic CoOperation and Development (OECD): Database PISA 2006. 2007b. Available online at http://pisa2006.acer.edu.au/downloads.php
Organisation for Economic CoOperation and Development (OECD): PISA 2006: Technical report. Paris, France: OECD Publications; 2008.
Organisation for Economic CoOperation and Development (OECD): PISA 2009 results: What students know and can do (Vol. 1). Paris, France: OECD Publications; 2010.
Pedhazur EJ: Multiple regression in behavioral research. 3rd edition. Orlando, FL: Harcourt Brace; 1997.
Rubin DB: Inference and missing data. Biometrika 1976,63(3):581–592. 10.1093/biomet/63.3.581
Rubin DB: Multiple imputation for nonresponse in surveys. New York, NY: John Wiley & Sons; 1987.
Rutkowski L: The impact of missing background data on subpopulation estimation. J Educ Meas 2011,48(3):293–312. 10.1111/j.17453984.2011.00144.x
Shoemaker DM: Principles and procedures of multiple matrix sampling. Cambridge, MA: Ballinger Publishing Company; 1973.
Sinharay S, von Davier M: Extension of the NAEP BGROUP program to higher dimensions (RR05–27). Princeton, NJ: Educational Testing Service; 2005.
Volodin N, Adams RJ: The estimation of polytomous item response models with many dimensions. Paper presented at the Annual Meeting of the Psychometric Society. TN: Gatlinburg; 1997.
Wu ML, Adams RJ, Wilson MR: ConQuest: Multiaspect test software [Computer program]. Camberwell, VIC, Australia: Australian Council for Educational Research; 1997.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
RJA, AB and PL designed the analyses, carried them out and prepared the manuscript. All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 Multilevel item response models
 Questionnaire design