 Research
 Open Access
 Published:
Does early tracking affect learning inequalities? Revisiting differenceindifferences modeling strategies with international assessments
Largescale Assessments in Education volume 8, Article number: 14 (2020)
Abstract
The development of international surveys on children’s learning like PISA, PIRLS and TIMSS—delivering comparable achievement measures across educational systems—has revealed large crosscountry variability in average performance and in the degree of inequality across social groups. A key question is whether and how institutional differences affect the level and distribution of educational outcomes. In this contribution, we discuss the differenceindifferences strategies employed in the existing literature to evaluate the effect of early tracking on learning inequalities exploiting international assessments administered at different age/grades. In their seminal paper, Hanushek and Woessmann (Econ J 116:C63–C76, 2006) analyze with twostep estimation the effect of early tracking on overall inequalities, measured by test scores’ variability indexes. Later work of other scholars in the economics and sociology of education focuses instead on inequalities among children of different family background, using individuallevel models on pooled data from different countries and assessments. In this contribution, we show that individual pooled differenceindifferences models are quite restrictive and that in essence they estimate the effect of tracking by double differentiating the estimated crosssectional family background regression coefficients between tracking regimes and learning assessments. Starting from a simple learning growth model, we show that if test scores at different surveys are not measured on the same scale, as occurs for international learning assessments, pooled individual models may deliver severely biased results. Instead, the scaling problem does not affect the twostep approach. For this reason, we suggest using twostep estimation also to analyze familybackground achievement inequalities. Against this background, using PIRLS2006 and PISA2012 we conduct twostep analyses, finding new evidence that early tracking fosters both overall inequalities and family background differentials in reading literacy.
Introduction
In spite of the fundamental principle that all children should have the same learning opportunities, large differentials are observed among socioeconomic and demographic groups in the share of students attending academic upper secondary programs and obtaining tertiary education (Jackson 2013). Along inequalities in educational attainment, national and international standardized learning assessments have highlighted the existence of substantial differentials across social groups also in the children’s level of competences and curricular knowledge at earlier stages of schooling. The persistency of educational inequalities is an issue of major concern among social scientists, both as a problem of social justice per se, and for its societal and economic consequences. In fact, the literature emphasizes education as one of the major factors affecting the degree of income inequality (De Gregorio and Lee 2002) and social cohesion (Green et al. 2006), and there is ample evidence that the cognitive skills of the population and their distribution strongly affect economic growth (Hanushek and Woessmann 2015).
The development of international surveys on children’s learning like PISA, PIRLS and TIMSS—delivering comparable achievement measures across educational systems—has revealed large crosscountry variability in average performance and in the degree of inequality across social groups. A key question is whether and how institutional differences affect the level and distribution of educational outcomes. By exploiting the institutional variability existing at the crossnational level, international assessments allow to investigate empirically the role played by the characteristics of school systems (for extensive reviews, see Hanushek and Woessmann 2011; Woessmann 2016).
The age of tracking is indubitably the institutional feature that has raised the greatest debate. Tracking occurs when children choose between (or are placed into) different schooltypes to follow educational programs with different prestige level and learning targets. The age of formal tracking varies greatly across countries: between age 10 in many German states to age 16 in UK and in Nordic European countries. Instead, the American and Canadian schooling systems are comprehensive up to the end of secondary school, at age 18. Arguments in favor of early tracking relate to the potential advantages of instruction with homogeneous groups of children. Opponents of early tracking argue that it fosters educational inequalities. Firstly, children of higher socioeconomic backgrounds, by receiving more familial support, tend to be more motivated and to perform better even at a young age. Thus, early tracking exposes young children to homogeneous learning environments in terms of both ability and socioeconomic fabric. If peer effects operate, this segregation could be detrimental to children from disadvantaged backgrounds. Secondly, children of disadvantaged backgrounds are less likely to choose the academic track (and thus to be exposed to more ambitious learning content) even at similar levels of prior performance (Jackson 2013). A strong influence of families on their offspring’s educational choices—likely to enhance social origin inequalities because costs and benefits may be evaluated differently across backgrounds and because of information asymmetries—is more likely to occur when tracking occurs at an early age, and with weaker ability restrictions (Checchi and Flabbi 2013; Contini and Scagni 2011).
Because of its relevance, many scholars have analyzed the effect of tracking on achievement. Some studies exploit educational reforms put into effect in some regions or countries (Meghir and Palme, 2005 on Sweden; Malamud and PopEleches 2011 on Romania; Piopiunik 2014 on Bavaria; Kerr et al. 2013 on Finland). However, specific institutional reforms are implemented only in few countries and typically at once, so the impact of institutions cannot always be investigated in this way. Moreover, one should rely on before and after comparisons that may confound the effects of policies with other country and cohort effects (Brunello and Checchi 2007); even when they have high internal validity, the findings may not be easily generalized to different contexts.
Other studies exploit the crosscountry institutional variability and utilize the international learning assessments to estimate educational production functions, i.e. individuallevel models of achievement, on data pooled together from all countries. A number of contributions focus on the effect of tracking on family background inequalities at given age or stages of schooling (e.g. Brunello and Checchi 2007; Schuetz et al. 2008; Horn 2009; Woessmann 2010; Bol et al. 2014; Chmielewski and Reardon 2016). However, evaluating the impact of institutions exploiting crosscountry variability is problematic with crosssectional data, because of the difficulty to control for unobserved systemlevel factors potentially affecting inequalities at all schooling stages.
For this reason, in their seminal work Hanushek and Woessmann (2006) propose to use two crosssectional surveys held at different age or grades and employ a differenceindifferences strategy. This method, commonly used in econometric analyses, solves the problem of unobserved countrylevel heterogeneity by analyzing the outcome variable at two or more time points and examining the extent to which differences over time vary between treated and control units (hence, differenceindifferences). The underlying assumption is that if no causal effect were at work the two groups would experience the same time change: evidence of a divergence in this change signals a treatment effect. In particular, Hanushek and Woessmann (2006) apply differenceindifferences to test scores’ variability indexes, finding that variability increases in early tracking relative to late tracking countries. More recently, other scholars have adapted their approach to analyze how early tracking affects learning inequalities across social groups by applying differenceindifferences to familybackground differentials (e.g. Waldinger 2007; Jakubowski 2010; Ammermueller 2013; Ruhose and Schwerdt 2016).
Hanushek and Woessmann (2006) use twostep estimation: in the first step, they estimate the variability indexes for each country and survey; in the second step, they analyze variability at t = 2 as a function of variability at t = 1 and the early tracking indicator. As described below in more detail, the parameter of interest is the coefficient of early tracking in this second step. The later studies, instead, pool together the microdata from all countries and both assessments, and estimate individuallevel achievement models with individual and systemlevel explanatory variables (including the tracking regime), the time of the survey, various twoway interaction terms and a threeway interaction term between family background, time of the survey and the early tracking indicator. The coefficient of the latter term is meant to capture the effect of tracking on family background inequalities (i.e. the difference between the variation in the family background coefficients over time in early and late tracking regimes).
The comparison of the behavior of the estimates in individual pooleddata models and twostep strategies in standard crosssectional studies has been the object of recent methodological work (Heisig et al. 2017; Bryan and Jenkins 2016). In this paper, we analyze these strategies when applied to differenceindifference modeling. Our aim is to compare twostep and pooled individual models in terms of their capacity to deliver meaningful findings on the effect of institutional features on familybackground achievement inequalities. More specifically, we address an issue that to our knowledge is completely missing in the sociology and economics of education literatures, related to the fact that test scores released by different international assessments are not vertically equated, i.e. achievement is not measured on the same scale as children grow up. We demonstrate that when the dependent variable follows different metrics over time, differenceindifference estimation on pooled individual models relies on unnecessary and often untenable constraints, and thus may yield to meaningless findings. Instead, we show that this issue does not affect the twostep estimation strategy.
Against this background, by employing the data on reading literacy in PIRLS 2006 and PISA 2012, we carry out an empirical analysis of the effect of tracking on learning inequalities in reading literacy, using twostep analysis. Firstly, we replicate the analysis proposed by Hanushek and Woessmann (2006) on the test score’s standard deviation with more recent data; secondly, we analyze how tracking affects inequalities among children of different socioeconomic origin. Altogether, we provide new evidence that early tracking contributes to increasing overall variability and in particular the gap between children of different social backgrounds.
The remainder of the paper is organized as follows. In the next section, we describe the differenceindifference strategies employed in the existing literature to evaluate institutional effects on achievement inequalities. We start by describing the twostep approach employed by Hanushek and Woessamnn (2006), who analyze the effect of early tracking on countrylevel variability measures, and then move to the individual pooled models used to study the effects on family background learning inequalities. We show that individual pooled models are quite restrictive and that in essence they estimate the effect of tracking by double differentiating the (crosssectional) family background regression coefficients between tracking regimes and learning assessments. In the next section, we address the scaling issue: starting from a simple learning growth model, we outline the mechanisms at play and show that if test scores at different surveys are not measured on the same scale—as occurs for international learning assessments—differentiating cross sectional regression coefficients conveys little information on how inequalities develop as children grow older. We then analyze how the scaling issue affects the results of individual pooled differenceindifference models and demonstrate that the estimates of institutional effects delivered by pooled individual models may be severely biased. In the following section, extending the simple approach of Hanushek and Woessamnn (2006) to the analysis of the effect of early tracking on family background inequalities, we propose a more flexible twostep estimation strategy, first describing individual achievement differentials within countries and then relating familybackground regression coefficients to institutional variables. In the next section, we describe our empirical analysis and discuss the results. Conclusions follow.
Literature review
International learning surveys were designed to evaluate education systems by testing the skills and knowledge of students of different age in different domains. The Programme for International Student Assessment (PISA) evaluates reading literacy, mathematics and science on children of age 15 (OECD 2014). The Progress in International Reading Literacy Study (PIRLS) focuses on pupils in grade 4 (Mullis et al. 2012a) and the Trends in Mathematics and Science Study (TIMSS) on pupils in grades 4 and 8 (Mullis et al. 2012b). By providing comparable measures of competencies across countries, these international learning surveys are increasingly employed to analyze how educational systems affect achievement (Hanushek and Woessmann, 2011; Woessmann 2016). In this section, we analyze the empirical strategies most frequently adopted in the literature to evaluate the effects of systemlevel features on achievement inequalities and compare differenceindifference strategies in terms of their underlying assumptions and restrictions.
A number of contributions analyze test scores delivered by a single assessment administered at a given age or stage of schooling. While some studies focus on the effects of educational institutions (e.g. tracking, central examinations, school autonomy) on mean performance (Woessmann 2005, 2010; Fuchs and Woessmann 2007), others analyze the effects on inequality of opportunity, operationalized as familybackground performance differentials (Brunello and Checchi 2007; Schuetz et al. 2008; Horn 2009; Woessmann 2010; Bol et al. 2014; Chmielewski and Reardon 2016). Focusing on the effect of early tracking, Schuetz et al. (2008) and Horn (2009) report a substantive negative effect of tracking on social background inequalities in children’s performance, whereas Brunello and Checchi (2007) find the opposite effect on adult’s cognitive skills. Bol et al. (2014) investigate how central examinations affect the association between tracking and family background inequalities. Chmielewski and Reardon (2016) provide evidence that tracking also enhances income achievement inequalities. A twostep approach is employed in some cases (Schuetz et al. 2008; Woessmann 2010; Chmielewski and Reardon 2016). In the first step, the parameter of interest is estimated separately for each country with individuallevel achievement models, in the second, the relation between this parameter and systemlevel features is analyzed with a simple countrylevel model. Other scholars, instead, pool together the international data and estimate individual achievement models with institutional features as countrylevel explanatory variables (Woessmann 2005; Fuchs and Woessmann 2007; Schuetz et al. 2008; Bol et al. 2014). Models focusing on inequalities also include an interaction term between family background and institutional features: the parameter of interest is the coefficient of this interaction, capturing how family background differentials vary with educational institutions. Hence, although apparently substantially different, what twostep and pooled individual models do in essence is to compare familybackground regression coefficients across educational systems.
However, models based on a single learning assessment are open to criticism because they do not allow controlling for other crosscountry institutional, cultural and societal differences affecting inequalities also before tracking takes place. To overcome this problem, Hanushek and Woessmann (2006) propose differenceindifference modeling by exploiting surveys held at different stages of the schooling career, in order to study how inequality evolves in early tracking countries relative to late tracking countries. This strategy allows controlling for unobserved systemlevel factors affecting learning inequalities already existing before the first survey. More specifically, Hanushek and Woessmann (2006) use PIRLS (4th grade) + PISA (age 15) to investigate the effects of tracking on reading literacy and TIMSS (4th grade) + TIMSS (8th grade) to investigate the effects on math. The rationale is that while in 4th grade children are still in comprehensive school everywhere, in 8th grade (or at age 15) they have already been tracked in some countries while in others they have not. The focus is on the effect of early tracking on the overall test scores’ variability across individuals (measured by the standard deviation and selected interpercentile ranges). Using twostep estimation, they find that in tracked systems variability increases over time relative to untracked ones, concluding that early tracking increases learning inequalities.
Drawing on this idea, a number of scholars (Waldinger 2007; Jakubowski 2010; Ammermueller 2013) employ differenceindifference strategies to analyze the effect of early tracking or other educational institutions on achievement differential across social origin. Interestingly, these papers reach conflicting conclusions. Similarly, Ruhose and Schwerdt (2016) use differenceindifferences to study the effect of early tracking on achievement inequalities related to the migration background. Differently from Hanushek and Woessmann (2006), these scholars do not rely on twostep estimation; instead, they employ an extended version of the individuallevel model, estimated on pooled data from all countries and the two assessments. The dependent variable is the testscore; explanatory variables include family background, institutional characteristics (most often, an indicator of early tracking), timing of the assessment and all two and threelevel interaction terms between these variables. The coefficient of the threelevel interaction is intended to capture the extent to which family background inequalities vary over time in educational systems with certain characteristics (e.g. early tracking) relative to educational systems with other characteristics (e.g. late tracking). We will show that due to the different scaling of test scores in the different assessments, this strategy may deliver strongly biased results.
Before moving to the examination of the differenceindifference models in the existing literature, it is useful to review how inequality is conceived and operationalized in this literature.

Overall achievement inequality focuses on differences among individuals, regardless of their characteristics. It can be measured by any variability index, for example the test scores’ standard deviation or differences between selected percentiles of the achievement distribution (Hanushek and Woessmann 2006).

Inequality of opportunity between family backgrounds focuses on average differences between children of different family backgrounds—usually conceived as social background or, less frequently, as ethnic or migratory background. It can be measured by the family background regression coefficient in a regression model with other exogenous individual characteristics as controls.
How do these two measures relate? Let \(\gamma\) be the family background coefficient at a given survey. In a stylized model with only one explanatory variable, under the usual OLS assumptions: \({{\sigma }_{y}^{2}=\gamma }^{2}{\sigma }_{x}^{2}+{\sigma }_{\varepsilon }^{2}\). Hence, overall inequality \(\sigma_{y}^{2}\) depends on the familybackgroundspecific effect (\(\gamma )\), on the variability of family background in the population \(({\sigma }_{x}^{2})\), and on the influence of other factors independent of family background (\(\sigma_{\varepsilon }^{2}\)). This simple expression shows that overall achievement inequality and family background inequalities are distinct phenomena: indeed, they are related, but their relation need not to be strong.
Overall inequalities: Hanushek and Woessmann’s seminal paper
In their seminal paper, Hanushek and Woessmann (2006) analyze the effect of early tracking on overall achievement inequalities, as measured by variability indexes like the scores’ standard deviation. More specifically, they use twostep estimation: (i) in step1, they estimate the SD in each country and at each assessment; (ii) in step2, they examine the relation between the SD at t = 2 and the institutional variable I indexing early tracking, given the SD at t = 1. In particular, they estimate the simple linear model:
where subscript c denotes the country and 1 and 2 index the time of the survey. \(I\) is the binary variable indexing early tracking and \(u\) captures countrylevel unobserved characteristics affecting how inequalities develop between late primary school (t = 1) and secondary school (t = 2).
The effect of tracking is represented by \(d\), the average difference in the level of inequality at t = 2 between tracked and untracked systems, given the level of inequalities already existing at t = 1. The advantage relative to models based on single surveys is that due to conditioning on SD at t = 1, unobserved factors influencing inequalities developed up to t = 1 are taken under control. Indeed, (1) does not control for unobserved systemlevel factors affecting the development of inequalities between the two surveys. The identifying assumption is that \(u\) is orthogonal to the tracking regime; in other words, inequality changes between t = 1 and t = 2 should only depend on tracking or on other systemlevel features not correlated to the tracking regime.
Family background inequalities: pooled individual models
In the existing literature, the analyses of institutional effects on family background achievement inequalities follow a different modeling strategy. Individual data on different countries and assessments are pooled together, and test scores are assumed to vary with individual variables including family background, the assessment, and institutional characteristics. The strength of the family background coefficient is allowed to vary according to these institutional features.^{Footnote 1}
The simplest model adopted in the literature (Waldinger 2007; Jakubowski 2010; Ruhose and Schwerdt 2016) is:
Model M1
where \(Y\) is the measure of achievement, F is family background, I is the countrylevel binary variable indexing the early tracking regime, X is a vector of individual controls, T is a binary variable indexing the secondary school survey. Subscripts i, c and t refer to the individual, country and survey; thus, \(Y_{i1c}\) is the test score in primary school and \(Y_{i2c}\) is the test score in secondary school. Several individual (or schoollevel) controls and a countrylevel error component may also be included, but are not shown here for simplicity. The intercept \(\alpha_{0c}\) is a countryspecific fixed effect (estimated with country dummy variables), thus it need not to be independent of the other explanatory variables. The parameter of main interest is \(\lambda_{2}\), the coefficient of the 3level interaction term. Denote the family background coefficients at t = 1 and t = 2 as \(\gamma_{1}\) and \(\gamma_{2}\). The following relations hold: \(\gamma_{1} = \xi_{1} + \lambda_{1} I\), and \(\gamma_{2} = (\xi_{1} + \xi_{2} ) + (\lambda_{2} + \lambda_{2} )I\). The identifying assumption is that the achievement gap among family backgrounds at both surveys may vary across countries only depending on the tracking regime. Instead, unobserved countrylevel characteristics may influence mean achievement, but may not affect familybackground differentials.
Additional restrictions involving also the following model M2 are that the individual error term has the same variance across countries and that the coefficients of all other control variables are fixed across surveys and countries. This may be a substantial limitation: as shown by Guiso et al. (2008) and Penner (2008), for example, gender inequalities greatly differ across countries.^{Footnote 2}
A more flexible model includes country/time fixed effects (Ammermueller 2013) is:
Model M2
Here the intercept \(\alpha_{0tc}\) may vary freely across countries and over time, and is estimated as a fixed effect by including countrytime dummy variables. Family background coefficients in primary school are also unconstrained and estimated as fixed effects (hence \(\gamma_{{1{\text{c}}}}\) = \(\xi_{{1{\text{c}}}}\)). Instead, their variation between t = 1 and t = 2 depends only on institutional changes. Coefficients at t = 2 are \(\gamma_{{{\text{2c}}}} = \xi_{{{\text{1c}}}} + (\xi_{2} + \lambda_{2} I)\). The underlying assumptions are weaker in M2 than in M1, because unobserved country characteristics are allowed to affect family background inequalities at t = 1; instead, the change in family background inequalities between t = 1 and t = 2 may vary across countries only with the tracking regime I.
Since in these models the relation between \(\gamma_{2}\) an \(\gamma_{1}\) can be expressed as:
(in M1 this further simplifies, since \({\gamma }_{1\mathrm{c}}={\xi}_{1}+{\lambda}_{1}{I}_{c}\)), the parameter of main interest \({\lambda}_{2}\) corresponds to the standard differenceindifference definition:
representing the double difference in the family background regression coefficients between the two surveys (i.e. between secondary and primary schooling), and between tracked (I = 1) and untracked (I = 0) educational systems, but can also be interpreted as \(E{(\gamma }_{2}\left{\gamma }_{1},I=1\right)E\left({\gamma }_{2}{\gamma }_{1},I=0\right)\).
Validity of pooled individual differenceindifferences models
In this section we discuss the validity of the results delivered by pooled individual differenceindifferences models when evaluating the effect of institutional characteristics on learning inequalities. First, we review the scaling issue when comparing the achievement of children in different assessments, second we focus on the consequences on the differenceindifferences models employed in the existing literature.
The core question when evaluating the effect of early tracking on family background inequalities with differenceindifference strategies is: Do family background differentials in achievement increase more (or decrease less) in tracked systems relative to untracked systems? Hence, we face the problem of assessing how inequalities develop as children grow older in different educational systems. We start by saying that we will not address issues related to the tests’ constructs. Scholars usually utilize TIMSS math test scores in 4th and 8th grade, designed by IEA^{Footnote 3} to measure curricular competencies, or PIRLS and PISA’s reading test scores that, despite being administered by different agencies (IEA and OECD), are considered to follow similar constructs (Zuckerman et al. 2013).
Instead, we will focus on the fact that test scores in international assessments are not vertically equated, i.e. achievement is not measured on the same scale at different grades. As discussed by Bond and Lang (2013), scaling issues in test scores make it difficult to analyze the development of average test score differentials over time. Our rationale is the following. If expressed in different metrics, crosssectional regression coefficients are not comparable across surveys: their difference \(\left( {\gamma_{2}  \gamma_{1} } \right)\) is meaningless. We show this rather trivial point below, based on a stylized structural achievement growth model. We will then reinterpret under these lenses the results delivered by the differenceindifference strategies based on individual pooled models. For some reason, the scaling issue has been ignored in this literature: we presume that the implicit assumption is that with the double differencing the scaling issue would disappear. We will show this is generally not the case.
Comparing achievement inequalities as children grow: the scaling issue
To analyze the evolution of inequalities at different stages of schooling, we have to compare test scores’ inequality measures across assessments administered to children of different age. A relevant distinction in this case is between vertically equated and nonequated tests. In equated tests, some items appear in both assessments, allowing their “anchoring” (Bond and Lang 2013). This enables to express test scores in a common metric and evaluate achievement growth. However, international assessments held at different grades/age are not equated. As a result, as we discuss below, comparing achievement inequalities over time is generally not meaningful with original test scores delivered by the international agencies (standardized across countries), and conveys only limited information on the evolution of inequalities when using withincountry standardized test scores (produced by standardizing original scores relative to each country’s mean and SD).
Consider a stylized model of learning development according to which abilities of a cohort of children cumulate over time, so that achievement at time t equals achievement at time t − 1 plus a growth component (Contini and Grand 2017). This can be viewed as an ideal model of cognitive ability, assuming that ability can be measured on a meaningful interval scale and that it evolves linearly. Initial ability and growth may be affected by ascribed individual characteristics such as family background (e.g. socioeconomic status, minority, ethnic or immigrant origin) or gender.
Suppose we have two cross sectional surveys assessing students’ learning in a given country at different stages of the educational career, t = 1 and t = 2. In order to keep the formalization as simple as possible, we posit no measurement error, so that test scores are perfect measures of cognitive ability.
Same scale
Assume that test scores are measured on the same scale in the two assessments. Let \(y_{i2}\) be the score of individual i at t = 2 and \(y_{i1}\) her score at t = 1. To simplify the exposition, we refer to a single explanatory variable F (but clearly other individual controls should be included) and assume that:
In our current example, F is an indicator of family background, with F = 1 for high background and F = 0 for low family background. Achievement at t = 2 is given by achievement at t = 1 plus achievement growth \(\delta\):
Growth may be assumed to depend linearly on explanatory variables and may also depend on previous achievement:
\(\beta\) measures whether children of high backgrounds improve or worsen their performance between t = 1 and t = 2, relative to equally performing children of low backgrounds at t = 1: we call this “new inequalities” developed between the two assessments. Instead, \(\theta\) captures carryover effects of preexisting inequalities. The total effect of family background operating between t = 1 and t = 2 is \(\left( {\beta + \rho \theta } \right)\), given by the sum of the direct effect given previous achievement and the indirect effect via previous achievement.
With longitudinal data it is possible to evaluate achievement growth for each child, estimate model (8) and identify the structural parameters \(\beta\) and \(\theta\), disentangling the two different mechanisms at play in the development of inequalities over time. With crosssectional data, the longitudinal model and the structural parameters model \(\beta\) and \(\theta\) are not identified.^{Footnote 4} Nonetheless, the total effect \(\left( {\beta + \rho \theta } \right)\) is still identified. Consider the crosssectional model at t = 2:
The crosssectional coefficient at t = 2 is:
and the difference between the coefficients at t = 1 and t = 2 is (\(\beta + \left( {1 + \theta } \right)\rho )  \rho = \beta + \rho \theta\).
Different scales
International learning assessments, as many national surveys, are crosssectional, and achievement is measured on different scales as children grow older. Moreover, test scores are not vertically equated. In this case, we have to distinguish between the observed scores \(y^{\prime}_{1}\) and the (unknown) scores \(y_{1}\) representing achievement at t = 1 according to the scale employed at t = 2. In this case, the difference between the crosssectional regression coefficients at t = 2 and t = 1 does not identify \(\beta + \rho \theta\). Assuming for simplicity a linear relation between these scales (where \(\varphi\), and \(\omega\) are unknown and unidentifiable):
from (6) we obtain the model relating observed scores \(y^{\prime}_{i1}\) to family background \(F\):
so the estimable Fregression coefficient at t = 1 is \(\frac{\rho }{\omega }\) and represents the total family background differential developed up to t = 1 in the metric of the first assessment. The coefficient at t = 2 is given by (9). Patently, if \(\omega \ne 1\), the difference between the estimable crosssectional regression coefficients at the two assessments:
differs from \(\beta + \rho \theta\) and delivers meaningless results.
Standardized test scores
The most common strategy adopted in the existing literature to overcome the difficulties in comparing test scores measured on different scales is to standardize scores and compare average zscores of individuals of different backgrounds as children age (e.g. Fryer and Levitt 2004; Goodman et al. 2009; Reardon 2011; Jerrim and Choi 2013). If we want to analyze the development of family background inequalities in a given country, the standardization is obtained relative to the country mean and standard deviation. In a regression framework, this amounts to comparing regression coefficients of models run on withincountry standardized scores. Being invariant to the score metric, these quantities are comparable:
The difference between (13) and (14) informs on how many standard deviations two individuals of different family backgrounds are apart at t = 2 as compared to t = 1. However, withincountry variability is generally not the same across assessments, so this difference also depends on the standard deviations. As a result, the sources of the observed change are unclear. Children’s achievement is not influenced only by family background: if in a country the testscores’ variability increases because of growing differentials related to other characteristics (e.g. increasing gender inequalities), we could observe decreasing family background inequalities even if \(\theta > 0\) and \(\beta > 0\).^{Footnote 5} Hence, even if the comparison of regression coefficients on standardized scores is not meaningless, their difference does not allow to identify \(\beta + \rho \theta\) and is not fully informative on how inequalities related to family background evolve over time.
Relating crosssectional regression coefficients at different surveys
Once again, let us denote the family background coefficients at t = 1 and t = 2 as \(\gamma_{1}\) and \(\gamma_{2}\). Under the stylized achievement growth model (6)–(10), it is trivial to show that with same scale scores the relation between the regression coefficients at the two crosssectional assessments is:
because \(\gamma_{1} = \rho\) and \(\gamma_{2} = \left( {\beta + \left( {1 + \theta } \right)\rho } \right) = \rho + \left( {\beta + \rho \theta } \right)\), so \(k = \beta + \rho \theta .\)
Instead, with different scale test scores:
because \(\gamma_{2} = \rho + \left( {\beta + \rho \theta } \right)\) and \(\gamma_{1} = \frac{\rho }{\omega }\), \(\gamma_{2} = \omega \frac{\rho }{\omega } + \left( {\beta + \rho \theta } \right) = \omega \gamma_{1} + k\).
This result is crucial because the implied relation (4) existing between the Fcoefficients in pooled individual differenceindifferences models M1 and M2, corresponds to the case where \(\omega = 1\), thus is not valid under the differentscale case, occurring for international learning assessments.
The scaling issue in differenceindifferences pooled individual models
We have shown above that the difference between regression coefficients \(\gamma_{2}\) and \(\gamma_{1}\) when test scores are not equated is generally meaningless. In the previous section we reviewed the differenceindifference strategies employed in the literature on educational inequalities and highlighted that, in essence, individual pooled models identify the effect of early tracking or other institutional features on family background inequalities, by taking the (double) difference of crosssectional regression coefficients relative to assessments administered at different children’s age.
The key question now is: Does the double differentiation of regression coefficients solve the scaling problem? Starting from the stylized achievement growth model presented above, we now show that the answer is no.
To fix ideas, for a specific cohort of children think of PIRLS (4th grade) as the assessment at t = 1 and PISA (age 15) as the assessment at t = 2.^{Footnote 6} Data are crosssectional and test scores are not equated. Following the structural model, achievement depends on family background at t = 1 and t = 2 according to (9) and (11). Thus, regression coefficients, in the most general setting variable across countries, may be expressed as:
where \(\omega\) reflects the different scale used to measure test scores in the two surveys.
Differenceindifferences with model M1
In model M1, regression coefficients are allowed to vary across countries only according to the tracking system. Recall that in this case DID amounts to:
where \(\gamma_{1}\) and \(\gamma_{2}\) are the crosssectional regression coefficients of family background in the two assessments, \(I = 1\) represents early tracking systems and \(I = 0\) late tracking systems. Substituting the structural parameters (17) into this expression, and recalling that M1 assumes that the coefficients vary across countries only according to the tracking regime, we obtain:
The first term in square brackets is the difference between the regression coefficients in tracked and untracked systems in secondary school; the second one is the difference between the regression coefficients in tracked and untracked systems in primary school. This expression delivers meaningful results only in very peculiar circumstances: (i) in the fortuitous case that the different scales employed to measure achievement in the two assessments were additively related (\(\omega = 1)\); (ii) in the fortuitous case that the degree of inequality at t = 1 happened to be equal in tracked and untracked systems \((\rho_{I = 1} = \rho_{I = 0} )\); (iii) in the fortuitous case that the degree of inequality at t = 2 happened to be equal in tracked and untracked systems, i.e. if \(\beta_{I = 1} + \left( {1 + \theta_{I = 1} } \right)\rho_{I = 1} = \beta_{I = 0} + \left( {1 + \theta_{I = 0} } \right)\rho_{I = 0} .\) In general, however, the effect of tracking ends up being estimated by the difference between noncomparable quantities: the double differentiation does not solve the scaling problem.^{Footnote 7}
Differenceindifferences with model M2
In model M2 inequalities at t = 1 are unconstrained, whereas the changes occurring between t = 1 and t = 2 may only depend on the tracking regime. For this reason we let \(\rho\) vary freely across countries (indicated as \(\rho_{c}\)), but constrain \(\beta\) and \(\theta\) to depend on tracking. Substituting the corresponding regression coefficients into the expression for the standard \(DID\) we obtain:
where \(E (\rho_{c} )\) is the expected value of \(\rho\) in a given tracking regime. Once again, the estimated \(DID\) depends on the unknown scaling factor \(\omega\) and delivers meaningful results only under the fortuitous circumstances described above for M1.^{Footnote 8}
A final consideration is in order. To illustrate that double differencing does not solve the scaling problem we have relied on the restrictive stylized achievement growth model, but we believe our conclusions are far more general. In essence, we have shown that the flaws of individual pooled models M1–M2 are due the fact that they implicitly assume same scaling and that double differencing does not help: if these considerations apply to a simple model, they are very likely to hold also under more complex ones.
Twostep estimation
We have shown that under the stylized achievement growth model described above, the relation between the Fregression coefficients is of the type: \(\gamma_{2} = \omega \gamma_{1} + k\), but M1 and M2 implicitly impose \(\omega = 1\). One might consider a more flexible twolevel model—let us call it M3—not imposing this restriction, with an individuallevel model for each country and assessment, and a countrylevel model relating regression coefficients and institutional characteristics. In this sense, this model could be conceived as a “generalized” differenceindifferences model. An additional advantage of this specification is its transparency: first and second step models are simple, their underlying assumptions are clear, and the interpretation of the results is straightforward.
Model M3
The coefficients of the individual level model of test scores \(Y\) are allowed to vary freely across countries and across assessments held at different stages of schooling:
The regression coefficients of family background at the two assessments may depend on institutional characteristics and are related by a simple countrylevel linear model:
where \(u\) captures countrylevel unobserved factors affecting inequalities developing between t = 1 and t = 2, assumed to be uncorrelated to the tracking regime represented as before by a binary indicator I. In order to allow institutional effects to vary with previous inequalities, the model could also include an interaction term:
The effect of tracking is \(d + g\gamma_{1}\) (reducing to \(d\) in the case of no interaction), the average difference in the familybackground coefficients at t = 2 between tracked and untracked systems given the corresponding coefficient at t = 1. This is consistent with the nonstandard \(DID\) definition:
previously employed in Hanushek and Woessmann (2006) to analyze the effect of early tracking of overall inequalities, conceived as test scores’ variability. The identifying assumption is that inequality changes between t = 1 and t = 2 only depend on the tracking regime or on other systemlevel features not correlated to the tracking regime. Clearly, the salience of this approach depends on the existence of sufficient crosscountry variability in \(\gamma_{{1{\text{c}}}}\) and a substantial overlap of the distributions of \(\gamma_{{1{\text{c}}}}\) between the subgroups of countries identified by I = 0 and I = 1.^{Footnote 9}
Estimation of model M3 can be carried out in two steps, as in Hanushek and Woessmann (2006).
Step 1 In the first step, the familybackground regression coefficients in (20) are estimated with individual level models separately for each country and assessment, so no a priori restrictions are imposed on these coefficients over time or across countries. Since country samples are large, first step estimation usually delivers highly reliable estimates. As this specification also allows the coefficients of the control variables to vary across countries, the F coefficients are more likely to be valid estimates of the true familybackground net effect than in pooled models M1–M2.
Step 2 In the second step, the relation between family background regression coefficients and institutions is estimated with a simple linear model at the countrylevel, as in (21a) or (21b). Notice that in principle secondstep models can take any functional form and include other countrylevel explanatory variables as controls. Yet, due to small sample size, simple models with few parameters should be employed in practice. Another condition for the delivery of reliable estimates of (21a), (21b) is the existence of sufficient variability in the \(\gamma_{{1{\text{c}}}}\) distributions within institutional regimes.^{Footnote 10} Notice that a major criticism sometimes attributed to the twostep strategy is that second step estimation is usually performed on small samples. However, although less explicit, this problem also holds for individualdata pooled models, as the relevant sample size to the estimation of regression coefficients of countrylevel explanatory variables is the number of countries (Wooldridge 2010; Bryan and Jenkins 2016).
Differenceindifferences definitions
Even if model M3 is more general than M1 and M2, the conclusion that when test scores are not on the same scale the standard \(DID = (E(\gamma_{2} I = 1)  E\left( {\gamma_{2} I = 0} \right))  (E(\gamma_{1} I = 1)  E\left( {\gamma_{1} I = 0} \right))\) delivers meaningless results holds true also for M3.
The advantage of M3 is that here the identification of the effect of early tracking is reached by estimating the nonstandard DID: \(E(\gamma_{2} {}\gamma_{1} ,I = 1{)}  E\left( {\gamma_{2} \gamma_{1} ,I = 0} \right),\) in second step models (21a) or (21b), that directly relate regression coefficients at t = 2 to the tracking regime and regression coefficients at t = 1. The different scaling is no longer a problem because in regression models there is no need for dependent and independent variables to be on the same scale (unless a priori restrictions on the coefficient of dependent variables are imposed, as implicit in M2).
Identification of the structural parameters?
While the results described so far are very general, under the strict assumptions of the stylized achievement growth model, we could even take some steps further and derive some conclusions on the mechanism at play. According to (17), the following holds:
Thus \(\beta\) is related to the intercept and \(\theta\) to the slope. Let us allow \(\gamma_{{2{\text{c}}}}\) to vary with the tracking regime, according to (21a) and (21b):
Equation (22) is consistent with the first specification if on average \(\beta\) (new family background inequality developed between t = 1 and t = 2) varies across countries with the tracking regime and \(\theta\) (carryover effect of previously established inequalities) does not vary with the tracking regime. It is consistent with the second if both \(\beta\) and \(\theta\) vary with the tracking regime.
Thus, in principle secondstep estimation allows to draw some conclusions on the mechanisms underlying how family background inequalities change over time. More specifically: a resulting \(d \ne 0\) suggests that \(\beta\) varies between tracked and untracked regimes. Instead, \(g \ne 0\) suggests that \(\theta\) varies between tracked and untracked regimes. In fact, even if \(E\left( \theta \right)\) is not identified when \(\omega\) is unknown (i.e. when tests are not equated), \(\omega E\left( \theta \right)\) is identified and the expression \(\omega \left( {1 + E\left( {\theta_{c} I = 1} \right)} \right){ \gtrless }\omega \left( {1 + E\left( {\theta_{c} I = 0} \right)} \right)\) implies \(E\left( {\theta_{c} I = 1} \right){ \gtrless }E\left( {\theta_{c} I = 0} \right)\).
Due to the dependence of this result on restrictive assumptions, caution is advised when interpreting twostep results in this manner, as the linear specification may be only a convenience approximation of a potentially more complex relation between previous and later achievement gaps. In addition, the intercept’s estimate is usually unstable with small sample size, as occurs with crosscountry models relying on a limited number of countries.
Empirical analysis
Based on the methodological considerations developed in the previous sections, we now illustrate the analysis of the effect of early tracking on family background inequalities with twostep estimation, exploiting the international surveys on reading literacy PIRLS 2006 and PISA 2012. PIRLS interviews children attending 4th grade (children at age 9–10), while PISA focuses on 15yearold children. The time span between these surveys is approximately equal to the distance between age 9–10 and 15, so PIRLS 2006 and PISA 2012 can be thought as independent samples of a single birth cohort over time.
Following Abadie et al. (2015) who argue that a careful choice of the countries is necessary to reduce the risk of unobserved country level confounding factors, we consider only European and AngloSaxon countries, as they share comparable schooling systems, societal organization and cultures, ending up with 24 countries participating to both assessments.
By tracking, we refer to the formal sorting process into schooling institutions providing different academic content and learning targets, while we do not consider other forms of differentiation such as withinschool abilityrelated streaming. We define countries as “tracked” if this sorting process on regular children takes place up to age 15, as “untracked” otherwise. In our sample, we have 12 tracked and 12 untracked countries (Table 1). However, we also carry out robustness checks with alternative tracking variables: a dummy classifying countries tracking at age 15 as untracked (since tracking has taken place very recently) and the number of years since tracking.
In the empirical analyses, we focus on native children. The reason is twofold. Firstly, because we wish to avoid introducing an additional source of heterogeneity acrosscountries, due to the different composition of the immigrant background population in terms of countries of origin, immigration waves, socioeconomic fabric, and to the linguistic distance between countries of origin and destination. Moreover, as highlighted in Jakubowski (2010), some migrants were not in the country of the test in the 4th grade or were not fully exposed to the country schooling system, so the results from the two assessments may not be fully comparable. Secondly, because the relationship between social background and immigrant background educational inequalities is weak. Countries with low social background inequalities, often display large immigrant backgroundspecific penalties (i.e. controlling for social background, Borgna and Contini 2014). In this light, analyzing only native children has the advantage of avoiding confounding effects of early tracking on social background inequalities due to the specific effects on the immigrant background population.
In line with the methodological considerations developed in the previous sections, we apply twostep analysis to family background inequalities, but we also analyze overall inequalities (replicating the analyses carried out by Hanushek and Woessmann (2006) on more recent data and a different set of countries). In the first step, for each country and assessment we estimate the test scores standard deviations and the family background regression coefficients with model (20). We include two variables to measure family background, related to cultural capital: the lognumber of books, regarded in the literature as the best single predictor of student performance (Hanushek and Woessmann 2011), and a binary variable indexing whether at least one parent has tertiary education.^{Footnote 11} We also control for gender and age (see Appendix A for the definition of individuallevel variables). In the second step, we analyze the relationship between estimated inequalities at t = 2 and the tracking regime, given inequalities at t = 1.
First step results: preliminary findings
Firststep regressions are run with R routines designed to handle plausible values and complex sampling (Caro and Biecek 2017), using student replicate weights.^{Footnote 12}
To capture the effect of tracking on family background inequalities, instead of looking separately at the two explanatory variables, we focus on the linear combination of the coefficients of the two variables lognumber of books and the parental education dummy, highlighting the effect of tracking on the testscores differential between children with tertiary educated parents and “many” books (n = 500), and children with nontertiary educated parents and “few” books (n = 5) books. If \(c_{1}\) and \(c_{2}\) are the estimates of the two coefficients, in the tables below we report \(c_{1} \left( {\ln \left( {500} \right)  \ln \left( 5 \right)} \right) +\) \(c_{2}\) and name it Fgap.
Focusing on overall inequality, we find that on average the SD at t = 1 (PIRLS) is slightly larger in untracked than in tracked countries, whereas the relation reverts at t = 2 (PISA), where tracked countries display larger values (Table 2). A similar pattern holds when looking at family background inequalities, as the average achievement gap between high and low strata (Fgap) is nearly the same at t = 1, while at t = 2 it becomes much larger in tracked countries. Acknowledging that the interval scale of test scores is sometimes questioned (Bond and Lang 2013), we also look at country rankings—from smallest lo largest—obtaining similar results, but even more marked.
Second step estimation
To analyze overall inequalities, we replicate Hanushek and Woessman’s analyses and estimate model (1), as well as an extended version of this model including an interaction term between previous inequalities and the tracking regime. To analyze family background inequalities, we estimate models (21a) and (21b) relating the countrylevel measures of familybackground inequality at t = 2 to the tracking regime, given inequality at t = 1. Results are summarized in Table 3. The coefficients of prior inequalities are always positive, indicating that countries with high inequalities in primary school also tend to have high inequalities in secondary school.
Findings on overall inequalities—columns (1) and (2)—show that given SD in primary school, the SD is higher on average in tracking countries. The interaction effect is positive but not statistically significant. On average, the SD at t = 2 is 8 point higher (i.e. 8% of the average national SD) in tracked countries relative to untracked countries with the same SD at t = 1. Our results are consistent with the results in Hanushek and Woessman (2006), although they report substantially larger effects of early tracking (almost a quarter of a SD for reading literacy).
Findings on the effects of tracking on family background inequalities—columns (3) and (4)—indicate that early tracking is associated to larger inequalities. Given educational inequality already existing in primary school, the Fgap at age 15 is on average 20.4 score units—0.204 standard deviations in the OECD distribution—higher in tracked than in untracked systems. Adding the interaction term shows that the difference between tracked and untracked countries tends to increase at higher levels of inequality at t = 1. Similar results are found when considering countries tracking at age 15 as untracked (see robustness checks in the Appendix B, Table 6), whereas no interaction effect is observed when considering the number of years since tracking (Appendix B, Table 7).
In Fig. 1 we show the scatter diagram depicting observed family background inequalities (the Fgap) at t = 2 against the corresponding values at t = 1. The straight line represents the predicted relation by tracking regime, according to the estimates of model (21b) reported in column (4) of Table 3. First of all, this graph shows that in primary school family background inequalities vary considerably across countries even within tracking regimes. Secondly, it shows that at low levels of family background inequality in primary school, there is little difference in secondary school inequalities between countries with and without tracking; instead at high levels of inequality in primary school, tracked systems tend to become more unequal than untracked systems.
For illustrative purposes, we now attempt to interpret the results relative to the effect of tracking on family background inequalities in terms of the structural parameters of the achievement growth model (“Validity of pooled individual differenceindifferences models”). As already remarked, however, due to the restrictive underlying assumptions and the small sample size in the step2 estimation, this structural interpretation of the results is to be taken with caution.
Since the intercept does not significantly differ from 0, we should conclude that “new inequalities” developed between primary and secondary schools given prior achievement, represented by \(\beta\), are similar in tracked and untracked systems (or perhaps even lower in tracked systems, since the point estimate is negative although not statistically significant). Instead, carryover effects of previous family background inequalities in achievement, represented by \(\theta\), seem to be larger in tracked countries than in untracked countries, as implied by the substantially higher slope estimated for the former. In other words, according to this interpretation, the reason why family background inequalities tend to widen between primary and secondary school in tracked systems relative to untracked systems is because the gap between well and poor performing children in primary school (already socially determined) widens more in the former as compared to the latter. This seems reasonable: in tracked systems, wellperforming children attend the academic track as compared to more labormarket oriented schools more often than lowperforming children, although with different probabilities across family backgrounds. Thus, the gap between well and poor performing children may widen more sharply in these countries than in comprehensive school systems.
Differenceindifference with pooled individual regression models
For illustrative purposes, we also show the results of differenceindifference estimation on pooledcountries individual models M1 and M2, with the tracking regime as the variable of main interest and gender and age as controls. The model was run on a total of 240,273 individuals taking either the PIRLS or the PISA test, in the 24 countries of Table 1. The DID estimate turns out to be 22.50 (significant at the 0.10 level) for M1 and 24.83 (significant at the 0.05 level) for M2. Interestingly, these estimates are not sharply different from the value 20.40 delivered by the twostep estimation model (21a) and showed in Table 3 column (3). The reason is that in this particular case inequalities at t = 1 are very similar on average in the two regimes: as shown in Table 2, the mean Fgap is 83.5 points in tracked countries and 83.4 in untracked countries. Thus, in this particular case we fall into one of the fortuitous circumstances thoroughly discussed in “The scaling issue in differenceindifferences pooled individual models”, where the results of M1 and M2—although delivered by unnecessarily restrictive and weakly transparent models—happen not to be meaningless, as the estimated DID ends up being expressed in the metric of test scores at t = 2.
Discussion and conclusions
This article aims at contributing to the literature that reflects on the correct use of international learning assessments in econometric modelling (e.g. Jerrim et al. 2017). The specific purpose of this paper is to provide an indepth analysis of differenceindifferences strategies aimed at evaluating the effect of institutional features on learning inequalities, exploiting international assessments administered to children of different age. In the existing literature, differenceindifferences has been carried out with twostep estimation by Hanushek and Woessmann (2006), who analyzed the effect of early tracking on overall inequalities (captured by test score variability indexes). Other scholars, instead, analyzed the effect of early tracking and other features of the educational system on family background inequalities (captured by the familybackground regression coefficients), using individual level models on data pooled from different countries and different assessments. We demonstrate that scaling issues entailed by using nonequated test scores at different stages of schooling may severely undermine the validity of the results delivered by differenceindifferences pooled individual level models. Scaling issues do not apply instead to twostep estimation. Hence, provided that differenceindifferences be reputed a valid strategy for the problem at stake, we view twostep estimation as a much better alternative to pooled models’ estimation. Our methodological discussion can be extended to different institutional effects^{Footnote 13} and different research areas. For example, the scaling issue may be relevant when analyzing the impact of policies on fundamental individual characteristics changing over the life course, other than learning—for example, health, wellbeing or life satisfaction—for which different measurement tools are needed as people grow up from early childhood to adulthood (Lippman et al. 2011).
In the empirical section of the paper, we analyze the effect of early tracking on inequalities in reading literacy. Consistently with the methodological discussion, we apply twostep analysis on both overall achievement inequalities and family background inequalities. Our findings are that, given inequality in primary school, inequalities in secondary school are substantially larger in early tracking than in late tracking countries. When focusing on family background inequalities, we find that the difference between tracking regimes increases with inequality in primary school: early tracking seems to be detrimental to equity in particular in countries with strong inequalities already existing in primary school. Results on overall inequalities (measured by test scores’ standard deviations) go in the same direction, but are somewhat weaker. Altogether, our evidence is that early tracking increases achievement inequalities, in particular by widening the difference between children of different social origin. Pushing our conclusions even further, there is some evidence that the reason why family background inequalities tend to widen in tracked relative to untracked systems between primary and secondary school, is not related to a larger gap developed within this time span between previously equally performing children of different social origin, but instead to different carryover effects of inequalities already existing in primary school. More research is needed to confirm these findings and provide a fully convincing interpretation for them.
A remark on the limitations of policy evaluations based on crosscountry analyses is also in order. In general, results are not easily interpretable in causal terms. The main reason is that countries vary on a multitude of characteristics, so it is difficult to ‘hold other things constant’. This criticism applies in particular to conventional crosssection analyses, but despite milder conditions required, it may be directed also to differenceindifference models. Another reason is sample size, because identification of policy effects is reached by exploiting crosscountry variability in institutional variables, and the number of countries is usually small. In spite of these limitations, it is only by gathering evidence from different contexts and analytical strategies that we can make general statements on the effects of the policies or institutions of interest. Since institutions/policies are rarely subject to reforms (and if they do, it is ‘by luck’), we think it would be unwise not to exploit the great opportunity provided by international standardized learning assessments to build knowledge on how schooling policies and institutional arrangements relate to educational outcomes. Yet, modelling strategies have to be transparent, as well as the underlying assumptions and the conditions for the validity of the results.
Notes
 1.
Note that this strategy cannot be employed when inequality is conceived as a variability index, because family background differentials are expressed as differences between average performances across individuals, whereas variability indexes are not.
 2.
Limitations of pooled data models when the effects of individual variables vary across countries in standard crosssectional analyses are discussed in Heisig et al. (2017).
 3.
International Association for the Evaluation of Educational Achievement.
 4.
In particular circumstances identification of \(\beta\) is possible with pseudopanel techniques (Contini and Grand 2017).
 5.
 6.
 7.
See also the simulation exercise in the Additional file 1, Appendix.
 8.
See also the simulation exercise in the Additional file 1, Appendix.
 9.
Crosscountry variability in \({\gamma }_{1}\) is necessary for the model identification. A substantial overlap of the distributions of \({\gamma }_{1}\) between the subgroups of countries is necessary because we are aiming at estimating inequalities at t = 2 given inequalities at t = 1, thus at the same level of \({\gamma }_{1}\).
 10.
Using estimated values from a previous stage as a dependent variable in a second stage introduces downward biased standard errors of the coefficients’ estimates, because the second step ignores the estimation error from the first stage. Different software programs provide routines to address this specific issue. In the present context, however, not only the dependent variable is estimated in a previous stage, but also an independent variable (the Fregression coefficient at t = 1). This should lead to bias in the effect of the treatment variable I. At present, we are not aware of simple solutions to this problem, so we neglect this issue. However, due to the large sample sizes in the first step (in the countries of interest in PIRLS 2006: N = 3500–8000 and in PISA 2012: N = 5000–38,000), measurement error should be small and it should not lead to substantial bias in second stage estimation.
 11.
We use the number of books in the home and parental education (as reported by children) as measures of SES as they are the most frequently employed in this strand of literature. This occurs most likely because: i) as already stated, the number of books is the single best predictor of achievement; ii) they are available in both assessments and are flawed by lower shares of missing data than the corresponding parental reports, or than information on parental occupation. It must be noted, however, that based on comparisons between children’s and parents’ reports, and assuming that the latter are correct, these measures do not appear to have high reliability (Jerrim and Micklewright 2014). In the presence of classical measurement error, the consequence would be the underestimate of the SES effect but if the errors are similar in size across countries, the country rankings of inequality should not be affected (ibid.). The direction of the bias is difficult to predict in differenceindifferences analyses, because it applies to both an independent and the dependent variables (see also footnote 10, dealing with another source of possible measurement error). This issue is out of the scope of the present paper, so we will not make further reference to it, and assume that our main results will not be heavily affected by measurement error in SES indicators.
 12.
The full set of first step results is available from the authors upon request.
 13.
For example, strength of the private sector, the degree of autonomy and time devoted to instruction.
References
Ammermueller, A. (2013). Institutional features of schooling systems and educational inequality: crosscountry evidence from PIRLS and PISA. German Economic Review, 14(2), 190–213.
Abadie, A., Diamond, A., & Heinmueller, J. (2015). Comparative politics and the synthetic control method. American Journal of Political Science, 59(2), 495–510.
Bol, T., Witschge, J., Van de Werfhorst, H. G., & Dronkers, J. (2014). Curricular tracking and central examinations: counterbalancing the impact of social background on student achievement in 36 countries. Social Forces, 92(4), 1545–1572.
Bond, T., & Lang, K. (2013). The evolution of the blackwhite test score gap in Grades K–3: The fragility of results. The Review of Economics and Statistics, 95(5), 1468–1479.
Borgna, C., & Contini, D. (2014). Migrant achievement penalties in Western Europe. Do educational systems matter? European Sociological Review, 30(5), 670–683.
Brunello, G., & Checchi, D. (2007). Does school tracking affect equality of opportunity? New international evidence. Economic Policy, 52, 781–861.
Bryan, M. L., & Jenkins, S. P. (2016). Multilevel modelling of country effects: a cautionary tale. European Sociological Review, 32(1), 3–22.
Caro, D. H., & Biecek, P. (2017). intsvy: An R package for analyzing international largescale assessment data. Journal of Statistical Software, 81(7), 1–44.
Checchi, D., & Flabbi, L. (2013). Intergenerational mobility and schooling decisions in Germany and Italy: The impact of secondary school tracks. Rivista di Politica Economica, VII–IX(2013), 7–60.
Chmielewski, A. K., & Reardon, S. F. (2016). Patterns of crossnational variation in the association between income and academic achievement. AERA Open, 2(3), 1–27.
Contini, D., & Grand, E. (2017). On estimating achievement dynamic models from repeated crosssections. Sociological Methods and Research, 46(4), 683–714.
Contini, D., & Scagni, A. (2011). Inequality of opportunity in secondary school enrolment in Italy, Germany and the Netherlands. Quality and Quantity, 45, 441–464.
De Gregorio, J., & Lee, J.W. (2002). Education and income inequality: New evidence from cross country data. Review of Income and Wealth, 48(3), 395–416.
Fryer, R. G., & Levitt, S. D. (2004). Understanding the blackwhite test score gap in the first two years of school. Review of Economics and Statistics, 86(2), 249–281.
Fuchs, T., & Woessmann, L. (2007). What accounts for international differences in student performance? A reexamination using PISA data. Empirical Economics, 32(2), 433–464.
Goodman, A., Sibieta, L., & Washbrook, E. (2009). Inequalities in educational outcomes among children aged 3 to 16. UK: Final report for the National Equality Panel.
Green, A., Preston, J., & Janmaat, J. (2006). Education, equality and social cohesion. A comparative analysis. New York: Palgrave Macmillan.
Guiso, L., Monte, F., Sapienza, P., & Zingales, L. (2008). Culture, gender and math. Science, 30(320–5880), 1164–1165.
Hanushek, E. A., & Woessmann, L. (2006). Does educational tracking affect performance and inequality? Differencesindifferences evidence across countries. Economic Journal, 116, C63–C76.
Hanushek, E. A., & Woessmann, L. (2011). The economics of international differences in educational achievement. In E. A. Hanushek, S. Machin, & L. Woessmann (Eds.), Handbook of the economics of education (Vol. 3, pp. 89–200). Amsterdam: North Holland.
Hanushek, E. A., & Woessmann, L. (2015). The knowledge capital of nations: Education and the economics of growth, CESifo Book Series. Cambridge: MIT Press.
Heisig, J. P., Schaeffer, M., & Giesecke, J. (2017). The costs of simplicity: Why multilevel models may benefit from accounting for cross cluster differences in the effects of controls. American Sociological Review, 82(4), 796–827.
Horn, D. (2009). Age of selection counts: a crosscountry analysis of educational institutions. Educational Research and Evaluation, 15(4), 343–366.
Jackson, M. (Ed.). (2013). Determined to succeed? Performance versus choice in educational attainment. Stanford: Stanford University Press.
Jakubowski, M. (2010) Institutional tracking and achievement growth: Exploring differenceindifferences approach to PIRLS, TIMSS, and PISA data. In J. Dronkers (Ed.), Quality and inequality of education. Crossnational perspectives (pp 41–82). Springer.
Jerrim, J., Choi, A. (2013). The mathematics skills of school children: how does England compare to the high performing East Asian jurisdictions? Working Paper of the Barcelona Institute of Economics 2013/12
Jerrim, J., LopezAgudo, L. A., MarcenaroGutierrez, O. D., & Shure, N. (2017). What happens when econometrics and psychometrics collide? An example using the PISA data. Economics of Education Review, 61, 51–58.
Jerrim, J., & Micklewright, J. (2014). Socioeconomic gradients in children’s cognitive skills: Are crosscountry comparisons robust to who reports family background? European Sociological Review, 30(6), 766–781.
Kerr, S. P., Pekkarinen, T., & Uusitalo, R. (2013). School tracking and development of cognitive skills. Journal of Labor Economics, 31(3), 577–602.
Lippman, H., Anderson Moore, K., & McIntosh, H. (2011). Positive indicators of child wellbeing: A conceptual framework, measures, and methodological issues. Applied Research Quality Life, 6, 425–449.
Malamud, O., & PopEleches, C. (2011). School tracking and access to higher education among disadvantaged groups. Journal of Public Economics, 95(11–12), 1538–1549.
Meghir, C., & Palme, M. (2005). Educational reform, ability, and family background. American Economic Review, 95(1), 414–424.
Mullis, I. V. S., Martin, M. O., Foy, P., & Drucker, K. T. (2012). PIRLS 2011 International Results in reading. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.
Mullis, I. V. S., Martin, M. O., Foy, P., & Arora, A. (2012). TIMSS 2011 International Results in math. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.
OECD. (2014). PISA 2012 results in focus. What 15yearolds know and what they can do with what they know. https://www.oecd.org/pisa/keyfindings/pisa2012resultsoverview.pdf.
Penner, A. M. (2008). Gender differences in extreme mathematical achievement: An international perspective on biological and social factors. American Journal of Sociology, 114(S1), S138–S170.
Piopiunik, M. (2014). The effects of early tracking on student performance: Evidence from a school reform in Bavaria. Economics of Education Review, 42, 12–33.
Reardon, S. F. (2011). The widening academic achievement gap between the rich and the poor: New evidence and possible explanations. In G. J. Duncan & R. J. Murnane (Eds.), Whither opportunity? Rising inequality, schools, and children’s life chances. Russel Sage Foundation.
Ruhose, J., & Schwerdt, G. (2016). Does early educational tracking increase migrantnative achievement gaps? Differenceindifference evidence across countries. Economics of Education Review, 52, 134–154.
Schuetz, G., Ursprung, H. W., & Woessman, L. (2008). Education policy and equality of opportunity. Kyklos, 61(2), 279–308.
Waldinger, F. (2007). Does ability tracking exacerbate the role of family background for students’ test scores? Working Paper of the London School of Economics.
Woessmann, L. (2005). Educational production in Europe. Economic Policy, 20(43), 445–504.
Woessmann, L. (2010). Institutional determinants of school efficiency and equity: German states as a microcosm for OECD countries. Jahrbücher für Nationalökonomie und Statistik, 230(2), 234–270.
Woessmann, L. (2016). The importance of school systems: Evidence from international differences in student achievement. Journal of Economic Perspectives, 30(3), 3–32.
Wooldridge, J. M. (2010). Econometric analysis of crosssection and panel data (2nd ed.). Cambridge MA: MIT Press.
Zuckerman, G. A., Kovaleva, G. S., & Kuznetsova, M. I. (2013). Between PIRLS and PISA: The advancement of reading literacy in a 10–15yearold cohort. Learning and Individual Differences, 26, 64–73.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1:
Additional material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Contini, D., Cugnata, F. Does early tracking affect learning inequalities? Revisiting differenceindifferences modeling strategies with international assessments. Largescale Assess Educ 8, 14 (2020). https://doi.org/10.1186/s4053602000094x
Received:
Accepted:
Published:
Keywords
 International assessments
 Test scores
 Achievement inequalities
 Crosscountry analyses
 Educational systems
 Early tracking
 Differenceindifferences