A common goal of sample surveys is to measure a latent variable proficiency, an aptitude, an attitude, or the like and then relate that latent variable to other characteristics of the respondents. For example, in an educational context, the relationship that is examined might be the correlation between the latent variable and another characteristic, such as years of schooling, or it might be between-group differences in mean scores for a latent variable. The ultimate aim of the survey is to examine the distribution of the latent variable in the target population and to make inferences concerning the relationships between latent variables and other variables in the target population. In psychometrics, the science of constructing measures of latent variables, it is generally accepted that measures of latent variables are fallible and include random error components that must be taken into consideration when such inferences are being made. Cochran (1968), for example, argues that when measurement error in latent variables is ignored, most statistical tests are vitiated.
The study of statistical models with error invariables is a well-developed area of statistical and psychometric inquiry. Its extensive body of literature dates back to at least Adcock (1878, cited in Gleser, 1981), and from there to Gleser (1981), Anderson (1984), Mislevy (1985), Fuller (1987), and Adams, Wilson, and Wu (1997). Econometricians were the first to extensively study models with errors in the variables, and their use in econometrics became widespread Anderson, (1984). In psychological and educational research, the presence of substantial measurement error resulted in the development of linear structural relation (or LISREL) models (see, for example, Jöreskog & Sörbom 1984; Muthén 2002) and latent regression (also known as multilevel item response theory) models (Adams, Wu, & Carstensen, 2007; Fox & Glas 2002).
In the context of large-scale sample survey studies,a multilevel item response theory models have been the method of choice for investigators undertaking appropriate data analysis in the presence of measurement error. There appear to be three primary reasons for this choice. First, the models are scalable; that is, they are methods that have been demonstrated to work well in contexts with many thousands of sampled respondents, many latent variables, and hundreds of manifest variables. Second, they can be integrated with other key components of sample survey methodology, in particular the weighting and sampling variance estimation that is required in structured multistage samples. And, third, they can be broken into discrete steps so that the study developers can construct a database and secondary analysts can then use standard and readily accessible analytic tools to analyze the data in ways that properly deal with the impact of the presence of measurement error (Adams, 2002;Adams, Wu, & Macaskill 2007; Gonzalez, Galia, & Li 2004;Mislevy 1990).
Researchers exploring PISA, NAEP, and TIMSS data have used the multilevel item response theory approach to examine the relationships between a small number of latent proficiency variables, for example, three to seven such variables in the case of PISA, and quite a large number of other variables collected via respondent contextual questionnaires. To ensure adequate content coverage of the latent proficiency variables, PISA, NAEP, and TIMSS all use multiple linked test booklets, which means that although each respondent responds to just 60 (NAEP) to 120 (PISA) minutes of assessment material, the total sum of assessment material used far exceeds this amount.
As noted, each of these studies routinely uses linked (or rotated) assessment booklets, a process often referred to as a multiple-matrix sampling design (Shoemaker 1973). However, in order to broaden the assessment while limiting individual response burden, the studies rely on a single set of contextual variables being administered to all respondents. No attempt, as far as we are aware, has been made thus far to apply such a rotated design to the context questionnaires, and thereby extend the number of contextual variables beyond that which can be obtained from a single common questionnaire administered to all respondents. Gonzalez and Eltinge (2007
2007b), however, have discussed the possibility of using rotated questionnaires in the US Consumer Expenditure Quarterly Interview.
In this paper, we explore the possibility of administering rotated context questionnaires to respondents in order to expand the coverage of contextual variables in sample surveys that employ multilevel item response theory scaling models. In addition, we examine how a changed methodology might affect the continuity of results with respect not only to the latent proficiency variables themselves but also to their correlations with the context constructs. The specific context for our work is the PISA survey.
Although the idea of having rotated forms of the respondent context questionnaires in order to extend their content coverage is appealing, the situation for these questionnaires differs slightly from that for the test booklets. To illustrate this difference, we provide an overview of the PISA analysis approach and then follow it with an explanation of the difference between using data from the test booklets and using data from the respondent context questionnaire in the multilevel scaling model used in PISA.
The pisa analysis approach
PISA is a cyclical cross-sectional study, with data collections occurring every three years. Four PISA assessments have now been completed (OECD 2001
2004
2007
2010) and a fifth is being implemented. Here we discuss the third cycle of PISA, the data collection that occurred in 2006 (referred to as PISA 2006). Our focus on the third cycle reflects our decision to use data from PISA 2006 to explore the potential use of rotated questionnaires.
PISA 2006 tested three subject domains, with science as the major domain and reading and mathematics as the minor domains. PISA allocates more assessment time to a major domain than it does to the minor domains, and typically reports subscales for major domains but not for minor ones. During PISA 2006, 108 test items, representing approximately 210 minutes of testing time, were used to assess student achievement in science. The reading assessment consisted of 28 items, and the mathematics assessment consisted of 48 items, representing approximately 60 minutes of testing time for reading and 120 minutes for mathematics.
The 184 main survey items were allocated to 13 half-hour (30-minute) mutually exclusive item clusters that included seven science clusters, four mathematics clusters, and two reading clusters. Thirteen test booklets were produced, each composed of four clusters according to a rotated design. This approach resulted in 120-minute test booklets consisting of two 60-minute parts, each made up of two of the 30-minute clusters and with students allowed a short break 60 minutes after the start of the test.
The booklet design was such that each cluster appeared in each of the four possible positions within a booklet exactly once, and each cluster occurred once in conjunction with each of the other clusters. Each test item, therefore, appeared in four of the test booklets. This linked design made it possible, when estimating item difficulties and student proficiencies, to apply standard measurement techniques to the resulting student response data (OECD, 2008). Student performance results were reported in terms of one overall scale in science, five science subscales, one overall scale for mathematics, and one overall scale for reading.
Fitting a multilevel item response model
The PISA research team used the mixed coefficients multinomial logit (MCML) model, as described by Adams, Wilson and Wang (1997a) and Adams and Wu (1997), to scale the 2006 data, and they used the ConQuest software (Wu et al. 1997) to carry out the process. Details of the scaling can be found in Adams (2002). We provide a limited sketch of the process here so as to contextualize the extension to the methodology that we explore in this paper.
The multilevel scaling model used consists of two components a conditional item response model, f
x
(x; ξ ξ|θ), and a population model, f
θ
(θ; γ, Σ, W). The conditional item response model describes the relationship between the observed item response vector x and the latent variables, θ. The ξξ parameters characterize the items. The population model, which describes the distribution of the latent variables and the relationship between the contextual variables and the latent variables, is a multivariate multiple regression model, where γ are the regression coefficients that are estimated, Σ is the conditional covariance matrix, and W are the contextual variables.
The conditional item response model and the population model are combined to obtain the unconditional, or marginal, item response model:
(1)
It is important to recognize that, under this model, the locations of respondents on the latent variables are not estimated. The parameters of the model are γ, Σ, and ξξ, where γ and Σ are the population parameters and ξ are the item parameters.
Directly estimating γ and Σ from the item response vectors bypasses the problem of having fallible estimates of latent proficiencies, that is, the issue of problems caused by measurement error, as discussed in the introduction. This approach also leads to unbiased estimates of population characteristics being obtained, assuming, of course, that the data satisfy the assumptions of the scaling and regression models.
The item response model used in (1) does not require the same complete list of item responses for all respondents. So, provided that the item response data are missing at random (Rubin 1976), which is the case with the rotated test booklet designs used in PISA, this model is well suited to incomplete designs. However, this is not the case for the population model, which in its PISA implementation requires complete data.
Plausible values
Currently, only a limited range of researchers are able to implement methodologies that permit the estimation of γ and Σ for the set of contextual variables that are of interest to them.b Therefore, to support further analysis, PISA uses the imputation methodology usually referred to as plausible values (Mislevy 1991) during construction of its public access databases.
Plausible values are intermediate values that are used in the algorithm that is implemented in ConQuest to estimate the parameters of (1) (Volodin & Adams 1997). PISA plausible values are sets of imputed proficiencies that are provided, per respondent, for all latent variables included in the scaling. They are thus random draws from the estimated posterior proficiency distribution for each student. Adams (2002) details how the random draws are made. The theory supporting the use of the plausible value approach can be found in Rubin (1987) and Mislevy (1991); Beaton and Gonzalez (1995) provide an overview of how plausible values should be used.
A key feature of plausible values is that they allow the results obtained from fitting (1), in particular the regression coefficients γ and Σ, to be recovered without the need to access the specialist software required to fit model (1). They can also be used to estimate the parameters of any submodel of the regression model used in (1).
In PISA, the regression parameters in (1) are estimated on a country by country basis. Similarly, plausible values are drawn on a country by country basis. But before this model can be estimated, it is also necessary to select the contextual variables, W, that will be used in each country. In PISA and NAEP, these variables are referred to as conditioning variables. The steps used to prepare the conditioning variables in PISA are based upon those used in NAEP (Beaton 1987) as well as in TIMSS (Macaskill, Adams, & Wu, 1998), and are given below:
-
Step 1: Three variables (booklet ID, school ID, and gender) are prepared so they can be directly used as conditioning variables. Variables for booklet ID are represented by deviation contrast codes (Pedhazur 1997). Each booklet other than a reference booklet is represented by one variable. Variables for school ID are coded using simple contrast codes, with the largest school as the reference school.
-
Step 2: Each categorical variable in the student questionnaire is dummy coded. Details of this dummy coding can be found in the PISA 2006 technical report (OECD 2008). For variables treated as continuous (including questionnaire indices constructed using item response theory), missing values are replaced with the country mean, and a dummy variable indicating a missing response is created.
-
Step 3: For each country, a principal components analysis of the dummy-coded categorical and continuous variables is performed, and component scores are produced for each student (The number of components retained must be sufficient to account for 95% of the variance in the original variables).
-
Step 4: The item-response model is fitted to each national dataset. The national population parameters are estimated using item parameters anchored at their international location, and conditioning variables are derived from the national principal components analysis and from Step 1.
-
Step 5: Five vectors of plausible values are drawn using the method described above. The vectors provide a plausible value for each of the PISA 2006 reporting scales.
The pool of candidate variables for W consists of the variables in the student contextual questionnaire. Until now, these have been limited to the number of variables that can be obtained through the administration of a single 30-minute contextual questionnaire to all respondents. This situation raises the question of whether the multilevel item response theory methodology, which easily handles rotated assessment booklets, can be implemented with rotated contextual questionnaires. The question is asked because, in the case of rotated contextual variables, the set of candidate variables for W will differ for different respondents.
The motivation for using rotated questionnaires in order to extend the coverage of the student questionnaire stems from several somewhat related reasons. First, and just as is the case with the multiple matrix sampling for the test, it is desirable to limit the amount of time students are required to concentrate on completing a questionnaire while simultaneously providing opportunity to increase the content coverage of the questionnaires. Second, once various cognitive domains have been assessed, the number of variables or constructs that are thought to be related to performance in the different domains ends up being larger than if only one domain had been assessed. Third, given that in most countries the variance in performance between students exceeds the variance in performance between schools, it is necessary to seek out in the student questionnaire a greater number of variables for inclusion that can be used subsequently to describe performance differences between students.
However, before going down the path of rotating context questionnaires in PISA, we need to address three questions:
-
1.
Is it possible to develop a methodology that uses rotated contextual questionnaires?
-
2.
Will a change in the methodology to rotated questionnaires have an impact on the continuity of PISA results?
-
3.
Will such a change affect the estimated relationships between the context variables and performance?
In the following section, we explore five alternative approaches to allocating contextual variables to rotated booklets. We also examine the effects of these approaches on the estimated distributions of the latent proficiency. More specifically, we direct our analyses toward an examination of how the means and distributions of the latent proficiency variables are affected when two forms of the PISA student context questionnaire (StQ) are used. PISA rotates the two forms in a way that leads to all students being asked to respond to questions in a common part of the questionnaire and then leads to half of the students being asked to respond to questions in one of the two rotated parts of the questionnaire and the other half to the other rotated part of the questionnaire. Even more specifically, our analyses also address whether results differ depending on how student context constructs are assigned to the rotated forms.
We acknowledge the possibility of using other rotated questionnaire designs. One such design, for example, could involve three rotated forms, with each construct included in two of the three forms. Such an overlap would enable a wider range of subsequent analyses because it would allow calculation of the correlations between the constructs in the different forms. This approach, however, would reduce the additional space gained as a consequence of rotation. Furthermore, because the main aim of this paper is to examine the possible implications of a rotated questionnaire design for the proficiency estimates rather than search for the optimal questionnaire rotation design, we elected to use the two-form design.