International large-scale assessments (ILSA) as comparative education studies have gained prominence in recent decades in global, national, and even local education debates (UNESCO, 2019). Such recent studies as Trends in International Mathematics and Science Study (TIMSS, e.g., Martin et al., 2020), Progress in International Reading Literacy Study (PIRLS, e.g., Martin et al., 2017), and the Programme for International Student Assessment (PISA) date back in their origins to the 1950s and 1960s in which education became an active field of inquiry for all the social sciences and thus comparative education began to make increasing use of the more mature and developed social science methods (Anderson, 1961; Henry, 1973). Since these early origins and the associated discussions in comparative educational science about different methodological approaches to the field (see. Henry 1973), methodology typical for each comparative educational study has evolved in the last few decades. For two decades, the OECD’s studies for the Programme for International Student Assessment (PISA, e.g., OECD et al., 2009, OECD, 2012, 2014, 2017, 2021a) and the reporting of their results in a recurring three-year cycle have been a notable event in media coverage of widely discussed issues in secondary education (e.g., Grek, 2009). In addition to this media presence of international educational tests such as PISA, the comparison of one’s own country’s performance with that of countries with higher scores in particular not infrequently tempts political decision-makers to draw educational policy conclusions on this supposedly rock-solid empirical basis and to take ad hoc remedial actions that not infrequently turn out to be misleading in the long run (e.g., Singer & Braun, 2018).

Although from a methodological perspective, the study design of PISA, as well as TIMSS and PIRLS, cannot be termed panel or longitudinal studies, questions of development trends are becoming increasingly prominent in the reporting and reception of ILSA results. They might be used to legitimize national educational reforms (e.g., Fischman, 2019; Johansson, 2016; Grek, 2009). For PISA such trend observations are vindicated on the one hand despite cross-sectional but representative sampling of fifteen-year-old students at the respective survey period and on the other hand on the relative continuity about the type of data collected. The recurrent PISA results in the three core domains of reading, mathematics and science are therefore regarded as comparative trend indicators of the performance of the educational systems in the respective participating countries. Apart from the aspect that the resulting competitive horse race communication can be criticized as such (see Ertl, 2020), the question of how to methodically underpin the trend statements is therefore becoming increasingly important (see Singer & Braun, 2018).

The present article aims to investigate to what extent different analytical decisions regarding item calibration, proficiency scaling and linking of the single ILSA rounds may lead to different statements concerning development trends within and between the participating countries. Specifically, using PISA data collected in the past 2003 to 2012 rounds, we examine how different analytic choices in international comparative assessment might contribute to contrasting conclusions about the country’s mean differences in mathematics literacy when examined cross-sectionally and by trend.

In detail, these analytical choices relate to the type of selection of country sub-samples for item calibration, considering three different options as factor levels. Second, the selection of the (link) item sample refers to two different sets of items used within PISA from 2003 up to 2012. Third, the estimation method of item calibration is varied by applying two different types of estimation methods. Furthermore, we consider two types of linking methods as a basis for the cross-sectional country comparisons and trend analyses. We consider these different analytical choices as potential sources to increase the methodological variance in scaling and data analysis, leading to statements deviating from the official reporting concerning the cross-sectional and trend estimates in PISA.

For this purpose, we organized the present article as follows. First, an overview of the official methods for scaling the cognitive data in the PISA large-scale assessment (LSA) is given. The focus here is on the model and estimation method used for item calibration and scaling as well as the principle for linking different PISA rounds as it has been applied in PISA from 2003 up to 2012. We supplement this with some selected examples of empirical findings and theoretical considerations from the literature that critically address this ’official’ methodology used in reporting PISA outcomes so far. In turn, we investigate a strategy for reanalyzing the PISA database covering the cross-sectional assessment data from the beginning in the year 2003 up to the last paper-based PISA assessment in 2012. In the methods section of this paper, we also describe the extensive data preparation procedures in the form of a brief summary. This process of adapting and harmonizing the single cross-sectional data sets, which precedes the actual analysis, is necessary because the coding of student responses, the different naming of specific items with the same content, and the general handling of the data have been subject to numerous changes over the four PISA rounds. However, this adaptation and harmonization is an essential prerequisite for the reanalysis of trend and cross-sectional analyses and may pose a potential burden to other researchers that should not be underestimated when dealing with historical data.

In addition, the four analytical decisions considered in the analyses are presented. Based on the findings from the literature, these analytic decisions refer to the selection of the (link) items, the selection of the calibration sample(s), the estimation method utilized for item calibration, as well as the way of linking different PISA rounds. Finally, the results are discussed against the backdrop of the increasing influence of PISA results on policy decisions and longitudinal trend statements on the development of educational systems.

### Principles in OECD calibration, scaling, linking and trend reporting for PISA 2003 – 2012

Since its first implementation in 2000, the analysis of the data collected in PISA has been based on scaling models from the item response theory (IRT). For the PISA rounds 2000 up to 2012, the IRT base model used for item calibration and scaling principle is the partial credit model (PCM; Masters, 1982), which is an extension of the Rasch model (Rasch, 1960) for polytomous item responses. The probability of an answer to item *i* with \(K_i\) categories in category *k* (\(k=0, \ldots , {K_i}\)) is given by

$$\begin{aligned} \mathrm {Prob} ( X_i = k | \theta ) = \frac{ \exp ( k \theta + d_{ik} ) }{\sum _{h=0}^{K_i} \exp ( h \theta + d_{ih} ) } \; , \end{aligned}$$

(1)

where \({\theta }\) is the unidimensional ability variable and \({d_{ik}}\) is the difficulty of the *k*’th ’step’ of item *i* (see Masters, 1982 p. 172), with \({d_{i0}}=0\), standardized at the sum over all categories of the exponent of the difference of \({\theta }\) and \({d_{ih}}\) with \({h}=0, \ldots , {K_i}\). The specific model used for the multidimensional IRT scaling of the PISA domains was the mixed coefficients multinomial logit model (MCMLM; Adams et al., 1997), which can be seen as a generalization of the unidimensional PCM to model student ability in *D* correlated dimensions \({\theta _1}, \ldots , {\theta _D}\). In the MCMLM (see Adams et al., 1997), the item response of item *i* in category *k* is modeled as

$$\begin{aligned} \mathrm {Prob} (X_{i} = k , \varvec{A} , \varvec{B} | \varvec{\theta }) = \frac{\exp ( \varvec{b} _{ik} \varvec{\theta }+ \varvec{a} _{ik} \varvec{\xi })}{\displaystyle \sum _{h=0}^{K_{i}} \exp ( \varvec{b}_{ih} \varvec{\theta }+ \varvec{a} _{ih} \varvec{\xi })} \; , \end{aligned}$$

(2)

where \(\varvec{\xi }\) is the vector of estimated item parameters, which after reparametrization are the basis for the (mean) item difficulty \({\delta _i}\) (see Eq. 3 below), and known design matrices \(\varvec{A}\) and \(\varvec{B}\) containing all vectors \(\varvec{a} _{ik}\) and \(\varvec{b} _{ik}\) (\(i=1, \ldots , I\), \(k=0, \ldots , K_i\)), respectively. For the complete definition of the population model in PISA, the distribution of the vector of latent variables \(\varvec{\theta }\) is modeled by a multivariate normal density \(f_{ \varvec{\theta }} (\varvec{\theta }; \varvec{\mu }, \varvec{\Sigma })\) with mean vector \(\varvec{\mu }\) and variance matrix \(\varvec{\Sigma }\).

In PISA, this model was used for the official reporting of all rounds from 2000 to 2012 in two steps, preceded by a national calibration carried out separately for each country. The preceding national calibration step served to monitor the quality of the data and to provide a basis for deciding how to treat each item in each country. In some cases, this could lead to the removal of an item from the PISA reporting if it had poor psychometric properties in more than ten countries (a “dodgy” item, OECD, 2014 p. 148). First, in the international item calibration step, often referred to as international scaling in OECD technical reports, the item parameters are determined across countries, with the underlying response data consisting of 500 randomly selected students from each OECD country sample serving as an international calibration sample. In the second step, the student abilities were estimated by including an additional conditioning component in the scaling model. For this, \(\varvec{\mu }\) from the population model is replaced by a regression component \(\varvec{y} _n \varvec{\beta }\) where \(\varvec{y} _n\) is a vector for student *n* containing additional student information from the background questionnaire variables, and \(\varvec{\beta }\) is the corresponding matrix of regression coefficients. Note that in this latent regression model, the student abilities are not estimated directly, but a posterior distribution for the latent variable is specified from which plausible values (PV) are drawn. This principle of latent regression IRT modeling using auxiliary (student) information to estimate population characteristics is described by Mislevy et al., (1992) and is based on the principles of handling missing data by multiple imputation (Rubin 1987), adapted for proficiency estimation based on data resulting from booklet designed proficiency tests (see also Mislevy, 1991).

Based on such a cross-sectional calibration and scaling approach, the successive chain linking for trend analysis of PISA results across different rounds requires the existence of common items from earlier assessment cycles (see, e.g., Mazzeo and Von Davier, 2013). Typically, the following six steps were performed for linking proficiency measures between different PISA rounds until 2012 (OECD, 2014). In a first step, a calibration of the item difficulties was performed using the calibration sample from the current PISA round, as already mentioned above. In the second step, the obtained item difficulties are transformed with a constant such that the mean values of the item parameters of the common items are set equal to those from the previous round or the round to be linked. In the third step, the data set for all OECD countries in PISA 2012 is scaled twice – once with all items of the respective competence domain and once with the link items only. In the fourth step, for the sample of OECD countries, the difference between the two scalings is removed by applying an additional linear transformation that accounts for differences in means and variances in the two scalings (Gebhardt and Adams, 2007). This is followed in step five by estimating the person parameters (ability) for the current PISA round, which are anchored to the initial item parameters (first calibration step). Finally, the person parameters are transformed using the calculated transformation constants from steps two and four in the last step. As a result of such a linking approach (e.g., Dorans et al.,2007), proficiency estimates from different rounds can be directly compared on the same metric (Mazzeo and Von Davier, 2013).

The official PISA methodology address the uncertainty, which is associated with the round-wise calibration, when comparing different PISA rounds by taking into account so-called link errors. The basic idea behind the calculation of the link errors in PISA (up to the 2012 round) is to consider the differential functioning of the common items (DIF) across the PISA rounds to be compared (OECD, 2014), as it results from the respective international item calibrations from each single PISA round. Thus, in order to calculate the link error for the PISA round 2006 compared to the previous round 2003, first, the differences \({\widehat{\delta }_{i,2006}} - {\widehat{\delta }_{i,2003}}\) of the respective IRT estimated item difficulties \({\widehat{\delta }_i}\) of a set of \(I_0\) common link items can be computed. Under the assumption that the used link items represent a random sample of all possible link items, the link error \({{LE}_{2003, 2006}}\) for trend estimates for country means was then estimated as follows:

$$\begin{aligned} {LE}_{2003, 2006} = \sqrt{\frac{1}{I_0} \sum _{i=1}^{I_0} \left( \widehat{\delta }_{i,2006} - \widehat{\delta }_{i,2003} \right) ^2 } \; . \end{aligned}$$

(3)

This basic principle of linking presented here was retained for all further rounds up to 2012, whereby, however, the clustering of the items in individual units (item stems) and, as an additional item weighting factor, the fact that items with polytomous response formats have a greater influence on the competence scale score than dichotomous ones were additionally taken into account (see OECD, 2014 pp.160). The standard error \({SE_{2003,2006}}\) for a difference of the two country means from the PISA rounds 2003 and 2006 is determined by the two round-specific components \({\sigma _{\mu }}\) and the link component:

$$\begin{aligned} SE_{2003,2006} = \sqrt{ SE_{2003}^2 + SE_{2006}^2 + {LE}_{2003, 2006}^2 } \; , \end{aligned}$$

(4)

where \({SE_{2003}}\) and \({SE_{2006}}\) denote standard errors for the country means in PISA 2003 and 2006, respectively. Further detailed information and a formal description of the official procedure for determining link errors in the PISA rounds up to 2012 can be found in the technical report on PISA 2012 (see OECD, 2014 pp.159–163) as well as in the Annex A5 of the PISA results volume I (OECD, 2014).

The link errors determined in this way are then, for example, taken to supplement the standard errors of the country means to be compared in analyses of mean differences between countries. It can therefore be said that at its core, PISA uses a special case of the variance component model (see Robitzsch & Lüdtke, 2019) to determine composite standard errors. In the official analysis and reporting of PISA, such a model took into account only the variance component of the international item parameters across single PISA rounds to be compared and in addition to this DIF, takes into account the clustering and the response format of the items (OECD, 2014).

These analytical decisions and official procedures for calibrating, scaling, linking and reporting PISA results, briefly outlined here, have inspired some critical theoretical discussions and methodological research, which in turn evoked criticism about the PISA methodology. In the following section, we briefly review some key aspects of this criticism.

### On analytical decisions in large-scale assessments

The analytical principles outlined in the previous section and the resulting official methodological procedures for calibrating, scaling, linking, and evaluating PISA results have attracted various criticisms over time. These refer to different aspects of the applied methodology, each supported by recent empirical findings or simulation outcomes (e.g., Rutkowski et al., 2016; Rutkowski, 2014; Rutkowski, 2011; von Davier et al., 2019, Robitzsch & Lüdtke, 2019; Rutkowski et al., 2019). For example, studies such as (Rutkowski, 2014) suggest that using a background model for latent regression, besides its theoretically derived advantages (see Mislevy et al., 1992), can also be seen as an additional source of error variance to an uncertain extent. Specifically, (Rutkowski, 2014) shows that the misclassification of subjects based on deficient background information results in mean differences of groups being significantly underestimated or overestimated, which can also be interpreted with an under- or overestimation of variance in relation to the entire population. Thus, although using a background model in the scaling of ability estimates is currently a standard evaluation procedure in many large-scale assessments, this approach can also be criticized. This criticism is usually based on the suspected and sometimes empirically proven poor quality of the questionnaire data used in such latent regression models (see, e.g., Hopfenbeck & Maul, 2011). Typically, the criticized poor quality of the questionnaire data results from the high proportion of missing values (e.g., Rutkowski, 2011; Grund et al., 2021). In contrast, almost paradoxically, the introduction of the latent regression model is motivated precisely by the targeted increase in the estimation accuracy of the model parameters of the response model against the background of missing values by rotated booklet designs, as well as missing student responses in the cognitive assessment materials (see, e.g., Mislevy et al., 1992; Rubin, 1987; Mislevy, 1991). However, in the current practice of scaling PISA data using latent regression models, the necessary prerequisite of complete background questionnaire data is realized by the quite weak missing indicator method (MIM; Cohen & Cohen, 2003), which has been shown to be inadequate and prone to bias if missingness in the background variables is not missing completely at random (e.g., Schafer & Graham, 2002; Grund et al., 2021). The method of parameter estimation typically associated with the latent regression models is marginal maximum likelihood (MML) estimation. The efficiency of MML estimation based on this full information approach is founded on the theoretical assumption that with an asymptotic infinite size, no other estimator provides parameter estimates with smaller variances (e.g., Forero & Maydeu-Olivares, 2009). Under the assumptions of multivariate normality (but see Xu & von Davier, 2008 for modeling deviations from normality) and a correctly specified model, the latent variable model parameters are consistently estimated by simultaneous equation methods, for instance, full information maximum likelihood (FIML) (Lance et al., 1988). However, (Lance et al., 1988) pointed out that estimation methods with complete information may also have drawbacks. For example, a key requirement for the superior efficiency of ML methods based on full information is that the specification of the true model should be correct and specifically concerning the likelihood function (Johnston and Dinardo, 1997) noted that “If we maximize the wrong function, we get biased estimates.” (Johnston & Dinardo, 1997 p. 427). Moreover, for the not unlikely case of a (partial) misspecified model, especially in the social and behavioral sciences, effects of misspecification can spread over the estimates of the model parameters (Kumar and Dillon, 1987). The almost epistemological question about the ’truth’ of models in general and especially models in social and behavioral science is treated very thoroughly by Stachowiak (1973) in his general model theory (see also Heine, 2020). According to this, models as such, and just also psychometric models, are essentially characterized by their imaging feature, the shortening feature, and their pragmatic feature (Stachowiak, 1973 pp. 131-133). Thus, in the social and behavioral sciences, according to the imaging feature and shortening feature, a true model, regardless of its complexity, is unlikely to exist and will therefore virtually always be misspecified in empirical data, certainly to varying degrees. Somewhat more pointedly, the statistician George Box (1979) already expressed this fact by stating “All models are wrong, but some are useful” (Box, 1979 S. 202). Especially the aspect of the usefulness of models, which refers to the pragmatic feature defined by Stachowiak (1973), must be the focus when using psychometric models for scaling LSA data because the declared goal here is to establish an objective scoring rule for the item responses in order to allow a fair comparison (see also Robitzsch & Lüdtke, 2021 for adesign-based perspective). The aspects briefly outlined here concerning the appropriate degree of detail and the associated extent of tolerable misspecification of psychometric models for the scaling of LSA data are closely related to the question of suitable estimation procedures for their model parameters (see, e.g., MacCallum et al., 2007). As an alternative perspective, as compared to MML (i.e., FIML) for the estimation of latent trait models with ordinal indicators (the item responses), Forero and Maydeu-Olivares (2009) suggest the use of limited information (LI) methodology for estimation (see also Bolt 2005). Such LI methodology is associated with the tradition of factor analysis (e.g., Forero & Maydeu-Olivares, 2009; McDonald, 1999), and parameter estimation, instead of assuming complete information, relies only on univariate and bivariate information in the data (Maydeu-Olivares 2001; Edwards and Orlando Edelen, 2009). Furthermore, in the LI methodology, within the concept of factor analysis, in addition to the possibility of ML estimation of the parameters, there is the alternative of ordinary least squares (OLS) estimation, which has favorable statistical properties regarding the robustness of model misspecifications. If the sampling error is neglected (by assuming an infinite sample size), the model error (as outlined above) is still very likely to be present, which represents a lack of fit of the (thus misspecified) model for the population, MacCallum et al. (2007) emphasizes that, for example, the ML estimation, in contrast to the OLS estimation, is based on the assumption that the model is exactly correct in the population and that all error is normal theory random sampling error. Put simply, the ML estimation method ignores the possible existence of a model error or the associated misspecification of the model in relation to the empirical data. In contrast, with the OLS estimate, no distributional assumptions are made, and no assumption is made about sampling error versus model error (MacCallum et al.MacCallum et al., 2007), which in turn makes OLS likely to be more robust against a possibly misspecified scaling model. In a comparative analysis addressing the question of estimation accuracy Forero and Maydeu-Olivares (2009) show that comparable IRT model parameter estimates result from LI and ML methods. Specifically, the LI method (using OLS) provided slightly more accurate parameter estimates, and the ML method provided slightly more accurate standard errors (Forero and Maydeu-Olivares 2009). An item parameters estimation method for Rasch-type item response models, which can be attributed to the LI method is the PAIR algorithm (cf. Robitzsch, 2021a, Heine 2020). This calibration approach was introduced by Choppin(1968 see also McArthur & Wright 1985) as a sample-free calibration method for item banks in large-scale assessments, within the context of early approaches to
comparative
education. Choppin’s row averaging approach (RA) is based on pairwise information. It has the advantage of enabling a non-iterative identification of item parameters for the Rasch model and the PCM (Choppin, 1968; Heine and Tarnai, 2015). Moreover, the pairwise RA method, as with other LI methods like pairwise conditional maximum likelihood (PCML) or the minimum chi-square method (MINCHI), reduces the computational demand for item parameter identification based on large LSA data sets (Robitzsch, 2021a). Compared to PCML, the RA approach within the LI methodology provides OLS estimators for the item parameters (cf. Mosteller, 1951b; Mosteller, 1951a; Mosteller, 1951c; Heine, 2020). As a result of a systematic comparison of several LI estimation approaches against other methods for the Rasch model, (Robitzsch, 2021a) concludes that RA and similar LI methods can be beneficial in applied research. This benefit for applied research is based on the experience from the systematic comparison of the estimation methods that RA and similar LI methods can result in less biased item parameter estimates than ML-based methods, given possible model misspecification and local dependencies in the empirical data (Robitzsch, 2021a see also Forero and Maydeu-Olivares, 2009).

Another area in the discussion about the evaluation methodology of cross-sectional LSA data relates to the aspect of the longitudinal linking of different rounds of the assessment (e.g., Robitzsch & Lüdtke, 2019; Oliveri & von Davier, 2011; Fischer et al., 2019; Gebhardt & Adams, 2007). Specifically, the principle of a successive chain linking approach, as used in the PISA rounds up to 2012, was, for example, criticized by Oliveri and von Davier (2011, 2014). They, in turn, argued for a concurrent calibration approach, including all data from the previous PISA rounds, respectively. Such an approach was applied, for example, by von Davier et al. (2019) for historical PISA data and was first introduced in the official PISA evaluation from round 2015 on wards (OECD, 2017, 2021a). Researchers von Davier et al. (2019) conclude from their study that changing the linking method had an impact on the country mean results but not on the ranking of the cross-sectional country means. Their analyses showed that the Spearman rank correlations for the mathematics competency area were \({r_{s}}=0.994\) for the respective cross-sectional country means across all analyzed PISA rounds, a finding that von Davier et al. (2019) view as an indication of a valid or method-invariant country comparison in PISA. However, such an invariant cross-sectional rank order may not be sufficient for evaluating trend estimates. Trend estimates for a country are typically interpreted if they exceed statistical significance. However, if the choice of an analysis method impacts a country’s mean of 1 or 2 points on the PISA scale, it might be consequential for the interpretation of trend estimates. Furthermore, it could turn a statistically non-significant into a significant trend estimate, which, in turn, gets policy attention.

Although Fischer et al. (2019) found little differences among different linking methods and anchoring designs also for a longitudinal linking of competence tests in large-scale assessments scaled with the Rasch model, Robitzsch and Lüdtke (2019) demonstrated that the interpretation of national trend estimates could change when different approaches for linking and procedures to calculate standard errors are applied. In addition, Gebhardt and Adams (2007) emphasize the importance of the influence of item calibration based on different samples, that is, calibrating the items separately for each country as compared to the linear transformation approach, which uses a common set of item parameter estimates for all countries. Specifically, Gebhardt and Adams (2007) showed that the use of conditional rather than marginal means as a linking approach results in some differing conclusions regarding trends at both the country and within-country level.

Connected to the question of an appropriate linking approach for longitudinal comparisons is the question of sampling or selecting of subjects and items on which calibration and scaling are based. The question of the appropriate calibration sample and its effects on the competence measurement is proving to be increasingly relevant, especially against the increasing expansion of LSAs to other populations or states and economies (e.g., Rutkowski et al., 2019; Rutkowski & Rutkowski, 2021; Rutkowski et al., 2018). The relevance of selecting an appropriate calibration sample results from the typical fact that, for example, the PISA measuring instruments were originally developed for OECD member countries and are now increasingly used for surveys in emerging and developing countries (Rutkowski et al., 2019; Rutkowski and Rutkowski, 2021). It typically shows that around half of the existing PISA items are too difficult for these new PISA participants (Rutkowski et al., 2018), which means that an appropriate measurement of competence in low-performing educational systems can be subject to possible distortions (Rutkowski and Rutkowski, 2021). From a technical perspective, such distortion results from floor effects in the item responses (e.g., Rutkowski et al., 2019), which ultimately represent sub-optimal test targeting for certain populations, resulting from item calibration based on a more competent population than the target population.

The choice of items is an important factor in cross- and longitudinal-country comparisons insofar as it has a significant impact on the standard errors of the estimates of competence (Glas and Jehangir, 2013; Robitzsch, 2021c; Robitzsch and Lüdtke, 2019). Generalizability theory (c.f. Brennan, 2001) defines several facets for which generalization of results appears necessary. If tests are to generalize not only to the specific set of items used in the test, but to a potential universe of items in a performance domain, the source of variation in item selection (i.e., item sampling) must be taken into account. In educational research and large-scale tests, the idea of viewing the single items in a test as a realized subset of an ultimately infinite universe of items is a concept that was already introduced early on (e.g., Husek & Sirotnik, 1967, Lord & Novick, 1968). Regarding longitudinal analyses, Hutchison (2008), as well as Michaelides (2010) point out that the choice of link items between multiple studies in longitudinal analyses should preserve the interpretation of an item sampling. If the number of link items is too small or the specific choice of link items is not representative of the entire item set, biased estimates of performance trends may result (Mazzeo and von Davier, 2008; van den Heuvel-Panhuizen et al., 2009).

Based on this exemplary and possibly not extensive presentation of some selected findings from the methodological literature on LSAs, it can be stated, at least in summary, that different methodological approaches might lead to slightly different population estimates. This phenomenon can be described as method variance.

As already described above, official PISA reporting, particularly for reporting trends in country means, is based on a variance component model to construct composite standard errors to reflect the overall uncertainty in measurement. It must be noted that this approach of composite standard errors constructed for specific comparisons of statistics, such as country means, does not necessarily follow the classical definition of the standard error as a single unique measure of dispersion \({\hat{\vartheta }}\) for a single estimation function \({\hat{\theta }}\) for a parameter \({\theta }\) estimated for the population. Rather, different sources of variance (\({\hat{\vartheta }_{1}}, {\hat{\vartheta }_{2}}, ...\)) are assumed, which are summed to derive the final (constructed) standard error in order to quantify the overall uncertainty of measurement in the PISA LSAs. Against the background of such a model, however, the question immediately arises as to which are the relevant variance components in a typical LSA setting.

Specifically, Robitzsch et al. (2011) argue that in a concept of generalizability, (at least) three facets in testing play an important role: The sampling or selection of subjects, the sampling or selection of items, and the choice of statistical models. The empirical findings by Robitzsch et al. (2011) indicate that the sources of variation in item sampling and model choice, which are usually neglected in publications as compared to the sampling of respondents, are not negligible. More recently, concerning item selection for linking different PISA rounds, Robitzsch and Lüdtke (2019) conclude from results of simulation as well as reanalyzing trend estimates for reading from PISA 2006 to PISA 2009 that the PISA method underestimates the standard error of the original trend estimate. Thus, the number of countries with significant trend changes decreased from 13 to 5 when using a newly proposed method for determining standard errors compared to the official PISA standard error (Robitzsch and Lüdtke, 2019).

Despite the extensive evidence on single aspects of the methodology of the official PISA data analysis, excerpts of which are reported here, there is, to our knowledge, no comparative study that shows the relative importance of these single analytic choices with respect to the error component against each other. In this study, we will add another source of variance to the standard errors constructed within the framework of a variance component model for the measurement error.

On the one hand, such a comparative analysis is interesting and important from a methodological perspective, as it can contribute to placing the relevance of single analytical decisions concerning future PISA data evaluations on an empirical basis. From a practical perspective, the findings from the reanalysis of the PISA data, taking into account the key factors of methodological variance identified here, can help to make a more realistic classification of the significance of small country mean differences, both in a cross-sectional and longitudinal comparison.

To quantitatively formulate this additional variance component resulting from the analytical decisions on methods in evaluating PISA results, we will follow two strategies. One strategy borrows from the principle used in the official PISA reporting of calculating and using composite standard errors when looking at trends, taking into account a linking error as an additional error component (cf. Eq. 4). Second, for the definition of an extended confidence interval for the country means from the PISA results, we will adopt a strategy proposed by Leamer and Leonard (1983) for evaluating the maximum upper and lower bounds of estimates from different regression models (see also Mcaleer et al., 1985; Leamer, 1985). In the subsequent method section, we will present these two approaches in more detail.