 Research
 Open Access
 Published:
Schoollevel inequality measurement based categorical data: a novel approach applied to PISA
Largescale Assessments in Education volume 9, Article number: 9 (2021)
Abstract
This paper introduces a new method to measure schoollevel inequality based on Item Response Theory (IRT) models. Categorical data collected by largescale assessments poses diverse methodological challenges hinder measuring inequality due to data truncation and asymmetric intervals between categories. I use family possessions data from PISA 2015 to exemplify the process of computing the measurement and develop a set of countrylevel mixedeffects linear regression models comparing the predictive performance of the novel inequality measure with schoollevel Gini coefficients. I find schoollevel inequality is negatively associated with learning outcomes across many nonEuropean countries.
Introduction
Although the relevance of socioeconomic factors as predictors of children’s cognitive learning attainment is a highly disputed issue in terms of causality (Mayer, 1997), there is extensive and longstanding research recognising their important role in explaining educational disparities in terms of access and outcomes (Coleman, 1966; Del Bello et al., 2015). Furthermore, research from a range of disciplines has highlighted a negative association between socioeconomic disparity and individual outcomes, offering various explanations for a detrimental role of inequality on domains such as health and subjective wellbeing (Deaton, 2003; Schneider, 2016; Wilkinson & Pickett, 2006).
Socioeconomic variables play also an important role in LargeScale Assessments to explain or control for differences among groups in terms of learning outcomes and other variables of interest (Hopfenbeck et al., 2018). However, the possible interplay between schoollevel inequality and educational outcomes has been less addressed. Although previous research has developed alternatives to address the measurement of inequality based dichotomous or ordinal data, there has not been to my knowledge an alternative that computes inequality in the same statistical framework used in LargeScale Assessments by using Item Response Theory (IRT). In this paper, I develop a novel method to measure schoollevel assets inequality utilising IRT models based on the discrimination parameter \(\alpha\). The proposed inequality measure computes the dispersion of the data at a certain aggregated level—such as schools or countries. The measure allows both to rank observations in terms of inequality, and to compare the average of inequality across the schools. I exemplify this case computing inequality based on PISA in 2015 home possessions index (HOMEPOS).
The remainder of the paper is structured as follows. “Socioeconomic measurement in PISA” section discusses the role and limitations of socioeconomic variables in PISA and International LargeScale Assessments (ILSAs), while “The complexity of measuring inequality based on categorical data” section reviews the relevant literature regarding the measurement of inequality using categorical data, discussing the main methods previously developed in recent literature. “Alpha inequality: inequality based on an Item Response Theory paradigm” section briefly introduces IRT and summarises the methodological construction process of the inequality measure, named as Alpha Inequality. “Methods” section introduces the criteria used to analyse Alpha Inequality and the data used in the empirical section. “Results and discussion” section presents the findings of the construction process of Alpha Inequality and a comparative analysis of results with a Gini coefficient in terms of descriptive and inferential parameters, while “Conclusion” section concludes the study.
Socioeconomic measurement in PISA
The relevance of socioeconomic background questions in PISA as well as in ILSAs is twofold. First, socioeconomic variables are constantly used as control regressors as well as in the analysis of equality of opportunities of educational systems. For instance, PISA reports differences among scores within quintiles of wealth and report gaps explained by less privileged socioeconomic backgrounds (OECD, 2016). Second, due to the nature of PISA and other ILSAs, where there is limited time to cover diverse aspects of knowledge, students are exposed only to a portion of cognitive tests. Subsequently, socioeconomic information is used as auxiliary information to impute final learning scores, through a technique called plausible values, which are “drawn from a posteriori (data) distribution by combining the IRT scaling of the test items with a latent regression model using information from the student context questionnaire in a population model” (OECD, 2017, p. 128).
Extensive research has been done analysing background questionnaires in PISA, showing diverse limitations on socioeconomic indicators. For instance, there is evidence of crosscountry comparability deficiencies within and between PISA cycles (Lee & Von Davier, 2020; SandovalHernandez et al., 2019) and poor model fit (Rutkowski & Rutkowski, 2013). One of the main consequences is the distortion of achievement estimates—see, for example, Rutkowski (2011, 2014) and also Rutkowski and Zhou (2015). Additionally, prior research also reports deficiencies regarding the cultural validity of some questions. For instance, there is a particular bias towards describing better contexts of developed countries, such as the number of questions that reflect a certain type of cultural possession (Rutkowski & Rutkowski, 2010, 2013), The greater access to electronic goods or internet in current days does not necessarily differentiate among higher and lower classes as could happen in a recent past (Avvisati, 2020).
Turning specifically to HOMEPOS in PISA 2015, I observe questions’ wording that raises concerns regarding their weight in the index computation. For instance, 6 of the common 22 questions (27%) refers to the possession of different books, while 4 questions (18%) refer to electronic possessions. In that dimension, two questions present similar topics (‘Computers [desktop computer, portable laptop, or notebook]’ and ‘A computer you can use for school work’, which presents a strong polychoric correlation, r(492,640) = 0.739, p < 0.001). Additionally, there is one general question that does not seem to reflect socioeconomic status (‘a quiet place to study’), but an educational or academic environment. Finally, the question asking about the possession of ‘works of art’ at home is open to diverse interpretations, which may confuse respondents. This last question parameter is not included in official reports, although it was not formally excluded from the index (OECD, 2016, 2017).
Another relevant topic relates to the national items—three questions used by each country, which has been praised as a step forward in terms of each country better contextualisation of socioeconomic status (Rutkowski & Rutkowski, 2013). However, diverse points can be raised about those questions: first, they do not necessarily discriminate socioeconomic status but household choices (e.g., expresso machine in France or cultural television programs with payment in Albania). Second, they may refer to outdated technology (‘BluRay player’ in Mexico) or are biased towards specific sensitivities (‘Violin/Cello’ in Hong Kong, ‘Piano or violin’ in Taipei and Macao, or a ‘piano’ in the Netherlands). Third, only in a few cases, they relate to the possessions of luxury goods (‘summer residence’ and ‘swimming pool’ in Malta), which produce extreme parameters. It is also possible to detect redundancy of those national questions with the common questions. For instance, many questions regarding electronics are repeated (e.g. ‘laptop’ in Moldova and Finland or ‘tablet’ in Norway, Spain, Switzerland and UK; ‘musical instruments’ in the United States; an ‘encyclopaedia’ in Colombia), while local dependencies and inconsistencies among answers are not explicitly assessed by PISA (Avvisati, 2020). Finally, it is possible to find important differences in terms of factor loadings among countries (OECD, 2017), which suggests room for improvement in terms of capturing wealth in families. Additionally, one of the tradeoffs of extending national items in HOMEPOS is the difficulty to address crosscountry comparability issues using fewer common items across countries. While many criticisms can be made to HOMEPOS highlighting limitations and challenges, there still are a relevant source to be used with caution to shed light on the role of socioeconomic differences in schools.
The complexity of measuring inequality based on categorical data
Measuring inequality based on ordinal or binomial data—or a mixture of both, portrays a set of methodological challenges. First, certain distributional statistics such as the mean or variance or standard deviation cannot be properly drawn (Cowell & Flachaire, 2017; Zheng, 2008). Proportions and modes will be appropriate tools to analyse this type of data. Second, in many cases, ordinal data depict an arbitrary scale or asymmetric intervals in their response alternatives, which may also bias the analysis. For instance, a 5point Likert scale question does not necessarily represent the same difference between pairs of options. I could either choose the category to ‘agree’ or ‘strongly agree’—both options are closer in my mind in this case—with an opinion regarding certain policy addressing inequality within schools, although I will never choose the middlepoint category—‘neither agree nor disagree’—because I understand as very far from the ‘agree’ I might have chosen.
One of the consequences of dealing with categorical data is that traditional inequality measures, such as the Gini coefficient and generalised entropy indexes—for example, Theil or Atkinson indexes, which refer to inequality as a deviation from the mean or are meannormalised, cannot be suitably employed to measure inequality using categorical raw data (Cowell & Flachaire, 2017; Zheng, 2011).
Recent research has been developing alternatives to develop inequality measurements based on categorical data. Allison and Foster (2004) suggest comparing onevariable cumulative distributions of Likerttype questions by ordering the data and identifying the distance from the median as an inequality measure. As they mention, their method only applies when each case’s median coincides among them. Additionally, this method does not meet a desirable characteristic of any inequality index—the normalization axiom, where a distribution of identical observations, where there is total equality, desirably portrays a zero value. Based on that seminal idea, Abul Naga and Yalcin (2008) introduce a family of inequality indices based on the analysis of one variable normalising different questions’ scales. Under their method, different Likertscale questions—portraying 3, 5 or 7 alternatives—can be compared in terms of inequality. Zheng (2011) extends the approach to measuring inequality based on two variables. However, if the median does not provide an adequate reference for inequality—for example, when there is skewness on data, all previously measures may not capture the extent of the inequality.
A second approach developed to address this limitation is proposed by Cowell and Flachaire (2012, 2017). Instead of using the median as a reference, they compute inequality relative to a reference status. They suggest counting ranking positions of all observations and expressing them as proportions of the population. The measure could be either ‘downwards’ or ‘upwards’ in terms of relative position on a scale. Although very suggestive, this method does not seem adequate for measuring assets inequality due to the multivariate nature of a continuous wealth trait. However, the idea of maintaining the ordinality of the scales and ranking them rather than measuring inequality remain concepts in my proposed approach.
A third approach that addresses multiple variables consists of computing inequality based on latent variable methods. For instance, Mckenzie suggests a relative inequality measure towards identifying subpopulations’ disparity based on a polychoric Principal Component Analysis index data (2005). His method computes each subpopulation’s standard deviations divided by the variance explained by the first principal component, which additionally allows the comparisons of subgroups to the overall population inequality. The idea of ratios and comparing to the overall inequality average are kept in my proposal. In this case, IRT is chosen over polychoric PCA as a specific approach to model categorical data.
Finally, at least three caveats can be drawn when assessing schoollevel inequality based on HOMEPOS. First, HOMEPOS is derived through a posterior weighted maximum likelihood estimation (WLE), which assumes a normal distribution (Warm, 1989). In the case of PISA 2015, significant differences between countries occur in terms of the mean of HOMEPOS while there are fewer variations in the distribution across countries (see Fig. 5 in Annex 2). Second, simulations show that WLE tends to overestimate withinschool variance (OECD, 2009). This is relevant for our case as schoollevel inequality is relative to the variance of school HOMEPOS. Third, WLE is sensitive to ceiling and floor effects if items are too easy or difficult, respectively. This contradicts another desired property of any inequality measure—scale invariance, where proportion changes to answers should not modify inequality. For example, if we add 10% of wealth to everyone, inequality remains the same as previous. Finally, as WLE are only a single possible realization of the estimation it does not addresses the uncertainty of the model, which could be adapted by using plausible values as independent variables (Pokropek, 2015). However, to address current limitations with measuring inequality based on WLE, I compute inequality based on the raw answers of family possessions rather than using the derivedindex HOMEPOS.
Alpha inequality: inequality based on an Item Response Theory paradigm
Item Response Theory models
The proposed inequality measure—hereafter, Alpha Inequality—builds upon the discrimination parameter from IRT models. IRT is a statistical family of latent construct analysis that focuses on categorical data and is mainly used in educational and psychological fields. IRT assumes that each person has a certain level—called individual trait—of an unobservable continuous construct (e.g., knowledge, competences, attitudes) that predict the probability of answering correctly or endorsing an observable item (e.g., cognitive questions). In this case, the higher the possession of the construct—family wealth, the higher the probability of answering the possession of an item—electronic good.
It is based on the notion that the probability of a correct response or endorsement to an item is a function of both the person’s trait and certain item parameters—such as difficulty, discrimination or pseudo guessing (Embretson & Yang, 2006). The item parameters determine the information offered by each item to any person’s trait level.
The simplest IRT model is often called the Rasch model (Rasch, 1960). According to the Rasch model, an individual’s response to a binary item (i.e., right/wrong, agree/disagree) is determined by the individual’s trait level and one item parameter—the difficulty of the item. Because this model uses the logistic density function and uses a single item parameter, it is called the oneparameter logistic model (1PL) (Fischer, 1995)—although there are some conceptual differences between Rasch and 1PL. Other IRT models have been developed covering ordinal and nominal data; adding parameters to the logistic function such as the discrimination or guessing parameters (Embretson & Yang, 2006); and also using distinct methods towards dichotomising data for the analytical modelling process.
For instance, in 2015, PISA uses two IRT models: the generalised partial credit model (GPCM) (Muraki, 1992) for multiitem questions and the twoparameter logistic model for dichotomous items. In both cases, it adds the item discrimination parameter \(\alpha_{i}\) to the function, which will be explained later. The GPCM presents the following notation:
which expresses the probability of an individual \(i\) correct response (or endorsement) \(X_{i}\) to an item \(j\) for the total number of categories K of each question. \(\theta_{j}\) represents the individual’s trait level, while \(\beta_{k}\) refers to the item difficulty or location. The parameter \(ak_{k}\) indicates the ordering of the categories from 0 to \(k  1\) (Chalmers, 2012).
The discrimination parameter \(\alpha_{i}\) represents the degree to which an item differentiates between respondents in different regions of the measured latent trait \(\theta_{j}\) (in this case, household possessions). The parameter defines the steepness of the slope when \(P\left( \theta \right) = 0.5\), where higher values suggest a better separation between individuals with higher and lower latent traits. Therefore, if \(\alpha_{i} \to \infty\), the item represents a perfect separation between those who respond correctly, in this case, have a specific possession, and those who do not have ownership of it. Figure 1 is a simulated example of item characteristics curve (ICC) for three items, where item 3 has a higher discrimination parameter than the other two items because it 3 shows a steeper curve than items 1 and 2. The item discrimination parameter \(\alpha_{i} { }\) reflects the sensitivity of the response probability to trait levels changes (Embretson & Yang, 2006) and gives information on the importance of the item to the individual trait—in this case, how relevant possessing certain good reflects family wealth.
Now I depart from the usual IRT parameter interpretation to turn into the consideration of inequalities. First, let us remember that inequality is an aggregated measure and not an individual condition. Therefore, we can think the latent trait as a continuum of equality (or inequality) of wealth for all respondents. In the hypothetical case that all respondents fall into the same value of \(\theta\), then the item represents an egalitarian condition—irrespective of the location in the xaxis of \(P\left( \theta \right) = 0.5\), where values in the left of the axis would represent poverty while in the right would represent richness). If the same occurred for all items, then there will be a status of full egalitarianism. Additionally, as the parameter defines the steepness of the ICC, larger item discrimination also means that the gap between those that are below the 50% probability of endorsing the item and those over that threshold has greater weight in terms of splitting individuals in the trait. The Alpha Inequality is based on this interpretation of the discrimination parameter.
Developing Alpha Inequality
The building process of Alpha Inequality, \(I_{j} \left( x \right)\), of any economic variable of interest—in this case, household assets possession—implies the following steps. First, the method involves modelling any IRT or latent variable model that considers the binary or ordinary nature of the responses—such as the graded response model, continuation ratio model, among many—and assumes the existence of a discrimination parameter that differs between items—which is not the case of a 1PL model. In this example, I use GPMC for polytomous questions and 2PL for binary items to coincide with the PISA 2015 modelling strategy.
The first step involves computing the IRT models for each item used in building the index and extracting the \(\alpha_{i}\) parameters. The second step consists of normalising all answers alternatives, \(\varsigma_{i} ,\) into the same range of values, in this case, from 0 to 1. This is done to give the same importance to polytomous and binomial questions in terms of a similar contribution to the inequality measure. The third step involves the sum of the product of each parameter \(\alpha_{i}\) and the observation score \(\varsigma_{ij}\) for each observation (person), \(j\), of the dataset. This is noted as follows:
In the case of missing data, I weight each observation \(j\) according to the number of questions answered, \(q_{j}\) to differentiate questions not answered from the absence of possession of an item, such as in:
The final step implies computing the inequality measure for each school, \(I_{\varphi }\), which allows comparing between school, as well as assessing if schools reach an egalitarian status, where \(I_{\varphi } = 0\). The inequality measure for each school \(\varphi\) is computed as the ratio between the standard deviation of \(\omega \xi_{j}\) by the standard deviation of the entire population \(c\), in this case, each country, \(\xi_{c}\), which can be expressed as:
Following McKenzie (2005), this provides additional information such as if \(I_{\varphi }\) is greater that one, the school displays more inequality than the country average inequality.
Every inequality measure has some properties to fulfil to provide reliable information regarding the distribution of any variable, in this case, wealth: scale and anonymity invariance, population independence, and binding the Pigou–Dalton transfer principle (Cowell, 2016). The Lemma containing how \(I_{\varphi }\) fulfills all main axioms and its proof can be found in Annex 1.
Methods
Data
I use the wealth index, HOMEPOS from PISA 2015 to exemplify and evaluate the performance of Alpha Inequality. PISA 2015 collects data from dichotomous and ordinal questions on 25 household indicators across 73 countries and economies. The target population and sampling strategy aimed to represent the universe of 15yearold students enrolled in each educational system. Students are sampled following a stratified design, where a minimum of 150 schools with proportional probabilities to the student population is initially selected. The minimum sample expected by a school is 20 students to ensure adequate accuracy in estimating between and within schools variance (OECD, 2017).
HOMEPOS is computed based on data collected from three student’s questions (ST011, ST012, ST013), with 25 questions covering different household assets and characteristics. Question ST011 displays two sets of dichotomic questions (possible answers: ‘yes’, ‘no’): thirteen that are common to all countries and three questions which differ by each country (called national items). Question ST012 displays eight 4response option questions (possible answers are: ‘none’, ‘one’, ‘two’ and ‘three or more’), common to all countries, while Question ST013 present one questions with six scales (with the following possible answers: ‘0–10 books’, ‘11–25 books’, ‘26–100 books’, ‘101–200 books’, ‘201–500 books’, and ‘More than 500 books’).
Following PISA’s criteria (OECD, 2017), I subset those observations with at least 3 answers on the HOMEPOS scale and no missing values for the computed HOMEPOS scale. I exclude observations from schools with less than 20 observations. Additionally, data from two USA states and Puerto Rico, which did not provide identification of schools, are also excluded. The sample was reduced from 519,334 to 454,734 observations belonging to 69 countries, administrative regions, and economies and 13,387 schools. Descriptive statistics per country used in this study are in Tables 1 and 7 in Annex 2 shows the frequency of observations per country.
PISA’s modelling strategy for HOMEPOS is a twostep process. First, a multiple group IRT twoparameter model is estimated (GPCM for ordinal questions and 2PL for dichotomous questions). Subsequently, HOMEPOS is computed based on the posterior weighted maximum likelihood estimation (WLE) (OECD, 2017). As HOMEPOS published parameters by PISA are estimated from a sample and do not reflect the observations used in this study (OECD, 2017), I replicate the first step of PISA’s modelling strategy to extract the \(\alpha\) discrimination parameters for each country and items. Following PISA, I estimate 22 common questions with equal parameters while 3 questions had parameters freely estimated per country. Correlations between PISA’s HOMEPOS and the replicated index are over 0.939 for each country (see Table 8 in Annex 2).
Great variability is seen in terms of discrimination across items (Table 2), where, for instance, the questions ‘book of poetry’ and ‘classic literature’ present lower values, and in the opposite side, ‘internet access’ and ‘computers’ present the highest values among the common parameters.
There is also large variability in the parameters of the nationalspecific items, shown in Table 3. For instance, some countries present higher values in all three items, such as the case of Thailand, while the opposite also occurs, such as in the case of the United Kingdom. Germany is the only case that presents a negative discrimination parameter for the question ‘A TV in your own room’. A negative discrimination parameter suggests the latent trait diminishes with the ownership of the good.
As the objective of the study is to exemplify the construction of the inequality measure, I do not address and evaluate model fit and invariance analysis. I rely on PISA’s item invariance analysis—named root mean square deviance (RMSD), which states that invariance of HOMEPOS items across countries was analysed and ‘unique parameters were assigned if necessary’ (OECD, 2017, p. 342). However, as I was previously mentioned, prior research reports dispute the reliability and validity of socioeconomic scales in PISA. I acknowledge those limitations and focus, on the present study, only on the methodological contribution of building an inequality measure.
Criteria to assess Alpha Inequality validity
The strategy chose to examine Alpha Inequality assessing its validity in comparison to prior evidence and comparing results to a wellknown inequality index based on HOMEPOS such as the Gini coefficient. The Gini coefficient is computed based on HOMEPOS applying a correction for finite populations (Nygärd & Sandström, 1985). HOMEPOS was transformed into a range of positive values [0, 15.457] to address a requirement of the Gini coefficient computation.
First, I compare crosscountries rankings statistics from both measures and exemplify the relevance of inequality on learning scores in the case of the USA by comparing schools at both extremes of the inequality continuum.
Second, I model a set of textbook regressions to examine how Alpha Inequality and the Gini coefficient are associated with Mathematics scores. For each country, I fit two sets of twolevel mixedeffects linear models, allowing random intercepts to vary at schoollevels. This addresses the hierarchical structure of PISA, where students are nested in schools. Formally, the equation of twolevel random intercept model reads as:
where \(Y_{ij}\) denotes the outcome variable for the \(i\)th observation (student) of group \(j\) (School), \(\beta_{0j}\) the school intercepts (which are random variables enabling the quantification of the differences between groups). \(\beta ^{\prime}{\text{s}}\) are regression parameters invariant across groups. The different inequality measures are denoted by \(x_{1ij}\), while \(u_{j}\) is the groupdependent deviation from the intercept mean and \(\in_{ij}\) represents the error term. HOMEPOS was included in the model due to the influence of the difficulty parameter on the posterior estimation of HOMEPOS, which may allow a better understanding of the role of an inequality measure independent from the wealth possessions.
There are three key methodological considerations which should be considered when modelling data from PISA. First, it is important to consider that PISA is based upon a twostage stratified sampling strategy to select schools and students. I address this using sampling weights to account for differences in the probabilities of students, classes and schools being selected in the sample (Rutkowski et al., 2010). Considering a multilevel analysis setting, I follow current PISA’s practice since 2012 (OECD, 2017) using weights both at the student and school levels in the regression analysis. For the student level, I scale student weights following RabeHesketh and Skrondal (2006), which adjusts students’ weights by the ratio of the school size and the sum of students’ weighs, as follows:
School level weights correspond to the sum of \({\text{W}}\_{\text{FSTUWT}}_{RH  S}\) for each school.
Secondly, due to PISA’s design, tests scores are estimated as plausible values, where each student has 10 different marks. To address this uncertainty, I apply Rubin’s rules for handling multiple imputations (Rubin, 1987) both in terms of computing schools averages and modelling regressions for each plausible value, where I compute adjusted sets of coefficients and standard error estimates and join them in a final estimate. Finally, due to the stratified multistage sampling design mentioned earlier, I estimate the uncertainty associated with the sampling using PISA’s approach—Fay’s modification of the balanced repeated replication (BRR) method, which allows computing the sampling variance.
Item parameters are estimated through an iterative marginal maximum likelihood approach (Bock & Aitkin, 1981), using the expectation–maximization algorithm provided by mirt package (Chalmers, 2012) in statistical software R (R Core Team, 2020) and statistical analysis was performed using package BIFIEsurvey (Robitzsch & Oberwimmer, 2015).
Results and discussion
Comparison between two schoollevel inequality measurements
Comparisons between countries are only feasible if we assume the existence of measurement invariance across countries, which allows further inferential analysis in the same metric. Conditionally to the assumption of measurement invariance claimed by PISA (OECD, 2017, p. 342). Table 3 presents the average inequality per country and the inequality coefficient of variation (CV) for both inequality measurements. While Alpha Inequality/Gini aims to assess the level of schoollevel inequality per country, CV provides a sense of the variability of inequality within the educational system.
Looking at the Alpha Inequality values, countries from Latin America and South Asia such as Peru, China (4 cities), Indonesia, Thailand and Colombia present the lowest values of Alpha Inequality and, at the same time, high values of CV. The opposite occurs with countries such as Iceland, Finland, Estonia, Poland, and Norway, which present Alpha Inequality close to 1 while having low values of CV. This suggests important differences between the two groups of countries. The first group of countries are characterised by educational systems with socioeconomically more homogeneous schools and larger degrees of segregation between schools, dividing poor and rich in different schools. The second group presents relatively smaller socioeconomic differences between schools while having larger within schools’ economic diversity. This coincides with recent research focused on the analysis of segregation on different waves of PISA (Gutiérrez et al., 2019). Additionally, Alpha Inequality allows comparisons between countries (Table 4). For instance, Iceland, Kosovo, Moldova, Montenegro, Iceland, New Zealand, and Qatar present more than 35% schools with schoolinequality above their national average, while Indonesia, Israel, Peru, China (4 cities) and Thailand only present less than 5% schools above the national average of inequality (see Table 9 in Annex 2).
Figure 2 shows the distribution of Alpha Inequality for each school by countries. Alpha Inequality presents different distributions across countries, as could be expected based on prior cross country analysis (Thomas et al., 2001). In some cases, they approximate to Gaussian functions, such as the case of Brazil, Indonesia, and Australia, while in other cases there are bimodal distributions such as in the case of Malta, Macedonia, and Trinidad and Tobago. In many cases, kurtosis and skewness are relevant features to be observed on the distributions and inferential analysis.
On the other hand, the Gini inequality presents, in general, very low coefficients across countries and schools. National averages are in a range between 0.003 and 0.006, and countries such as The Netherlands, Denmark and the Slovak Republic appear with the smallest values while countries such as Trinidad and Tobago, Qatar and Algeria display the largest values. However, countries like Denmark and the Slovak Republic present high coefficients of variation, which contradicts previous empirical evidence in terms of segregation in schooling systems (Gutiérrez et al., 2019). Figure 3 shows school Gini density functions for each country, where in general, they present heavytailed distributions. Exceptions of bimodal distributions are Macedonia and Montenegro.
Countrylevel correlations of both inequality measurements present an overall mean of \(0.612 \left( {SD \;0.131} \right)\) ranging from \(0.105\) (Israel) to \(0.846\) (Qatar) (Table 10 in Annex 2).
To examine the impact of differences between both measurements, I turn to the case of the USA, which has more prior empirical analysis on segregation and inequality. The Gini coefficient does not provide a hint of difference between schools in the top 20% and the bottom 20% of the Gini index in terms of the average of mathematics learning scores per school. This contradicts prior estimations (Rutkowski et al., 2018) as well as crosscountry studies that focus on the segregation levels of USA schools and educational scores (Benito et al., 2014; OECD, 2018). Contrarily, Fig. 4 shows how schools with lower Alpha inequality outperform in terms of Mathematics Average by 0.57 standard deviations schools with the largest share of inequality with statistically significant differences between groups, t(60.36) = − 7.01, p < 0.001. This represents about 2 more years of schooling according to PISA (2009).
Models’ coefficients
Results from countrylevel mixedeffects regressions models can be seen in Table 5 with Alpha Inequality as a predictor of Mathematics score. I find that in 67 out of 69 countries show statistically significant negative parameters, while in the case of Indonesia and Vietnam the null hypothesis of a parameter different from 0 cannot be rejected under a standard cutoff of \(p < 0.05\).
On the other hand, Table 6 presents the estimations of regression parameters using the Gini coefficient for each county. In this case, the number of cases not showing a statistically significant association raises to 5, being Estonia, Iceland, Latvia, United Kingdom, and the United States of America. The case of the United States, as previously discussed, raises concerns in terms of estimation reliability of the Gini parameter due to the lack of ability to find statistical significance given the previous empirical evidence found in the literature. Additionally, Luxembourg is the only case portraying a positive coefficient for the slope of school inequality and mathematics scores.
Conclusion
This paper has found that a set of multivariate household possessions collected as categorical data can be used to provide a novel measure of inequality. The proposed inequality measure is independent of the scale of wealth and fulfils the main properties of inequality measures. Additionally, Alpha Inequality also allows for comparisons between and within countries.
Computing schoollevel inequality using data of PISA 2015, I find a consistent significant negative association of schoollevel inequality and Mathematics scores across countries—the great exemption being a majority of European countries. It is also suggested that the inequality measure outperforms the Gini coefficient in terms of assessing the association of schoollevel inequality and learning outcomes. This is consistent with previous research on the topic that identifies different levels of inequality within and crosscountries. In the case of known negative effects of inequality, Alpha Inequality is shown to better grasp the relevance of socioeconomic disparities between schools in terms of learning scores.
There are important limitations to be acknowledged. While the improvement of socioeconomic scales such as HOMEPOS focusing on the need of updating items to represent wealth in current times, crosscompatibility and model fit becomes a requisite to apply and study thoroughly the effects of schoolinequality, further research could point to different directions such as the assessment of inequality on cognitive and noncognitive educational outcomes across different waves of PISA as well the interplay between inequality, segregation and educational outcomes.
Second, there is a methodological debate regarding the inclusion of survey weights design into IRT scoring procedures to take account of the complex sampling designs and nested structure of item response data of PISA and other ILSAs. This uses multilevel item response methods and different weighting strategies (Zheng & Yang, 2016).
Third, alternative sampling weights scaling methods at both levels were explored (Mang et al., 2021) addressing the complexity of using within and between weights in multilevel clustered analysis. Although the number of statistically significant models varied, similar negative coefficients were found in all cases, and, in all cases, models with Alpha Inequality predictors were more sensitive than Gini. However, in some weighting configurations, large standard errors were found suggesting model identification or convergence issues.
This is relevant as sample design in PISA is informed by school socioeconomic attributes and the estimation of parameters—discrimination, among them—could be affected by the lack of weights. Further research could point the relevance of weighting IRT models to address socioeconomic sampling variances. In this case, I mimic IRT modelling singlelevel strategy and address the stratified complex sampling design in the multilevel regression model regression analysis including replicate and scale weights.
Availability of data and materials
The datasets used and analysed during the current study are available from the corresponding author on reasonable request.
References
Abul Naga, R. H., & Yalcin, T. (2008). Inequality measurement for ordered response health data. Journal of Health Economics, 27(6), 1614–1625
Allison, R. A., & Foster, J. E. (2004). Measuring health inequality using qualitative data. Journal of Health Economics, 23(3), 505–524
Avvisati, F. (2020). The measure of socioeconomic status in PISA: A review and some suggested improvements. Largescale Assessments in Education. https://doi.org/10.1186/s4053602000086x
Benito, R., Alegre, M. À., & GonzàlezBalletbò, I. (2014). School segregation and its effects on educational equality and efficiency in 16 OECD comprehensive school systems. Comparative Education Review, 58(1), 104–134
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29
Coleman, J. (1966). Equality of educational opportunity.
Cowell, F. (2016). Inequality and poverty measures. In M. Adler & M. Fleurbaey (Eds.), The Oxford handbook of wellbeing and public policy.Oxford University Press.
Cowell, F. A. & Flachaire, E. (2012). Inequality with ordinal data. In European economic association and econometric society conference, Málaga. https://doi.org/10.1111/ecca.12232/epdf
Cowell, F. A., & Flachaire, E. (2017). Inequality with ordinal data. Economica, 84(334), 290–321
Deaton, A. (2003). Health, inequality, and economic development. Journal of Economic Literature, 41(1), 113–158
Del Bello, C. L., Patacchini, E. & Zenou, Y. (2015). Neighborhood effects in education. IZA Discussion Papers.
Embretson, S. E., & Yang, X. (2006). Item response theory. In J. Green, G. Camilli, P. Elmore, A. Skukauskaiti, & E. Grace (Eds.), Handbook of complementary methods in education research. (pp. 385–409). Routledge.
Fischer, G. H. (1995). The linear logistic test model. Rasch models. (pp. 131–155). Springer.
Gutiérrez, G., Jerrim, J., & Torres, R. (2019). School segregation across the world: Has any progress been made in reducing the separation of the rich from the poor? Journal of Economic Inequality, 18(2), 157–179
Hopfenbeck, T. N., Lenkeit, J., El Masri, Y., Cantrell, K., Ryan, J., & Baird, J. A. (2018). Lessons learned from PISA: A systematic review of peerreviewed articles on the programme for international student assessment. Scandinavian Journal of Educational Research, 62(3), 333–353
Lee, S. S., & Von Davier, M. (2020). Improving measurement properties of the PISA home possessions scale through partial invariance modeling 1. Psychological Test and Assessment Modeling, 62(1), 55–83
Mang, J., Küchenhoff, H., Meinck, S., & Prenzel, M. (2021). Sampling weights in multilevel modelling: An investigation using PISA sampling structures. LargeScale Assessments in Education, 9(1), 1–39
Mayer, S. E. (1997). What money can’t buy: Family income and children’s life chances. Harvard University Press.
McKenzie, D. J. (2005). Measuring inequality with asset indicators. Journal of Population Economics, 18, 229–260
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Princeton.
Nygärd, F., & Sandström, A. (1985). The estimation of the Gini and the entropy inequality parameters in finite populations. Journal of Official Statistics, 1(4), 399–426
OECD. (2009). PISA data analysis manual. (2nd ed.). OECD.
OECD. (2016). PISA 2015 results (volume I): Excellence and equity in education. OECD.
OECD. (2017). PISA 2015 technical report. OECD Publishing.
OECD. (2018). Equity in education: Breaking down barriers to social mobility. OECD Publishing. https://doi.org/10.1787/9789264073234en%0A
Pokropek, A. (2015). Phantom effects in multilevel compositional analysis: Problems and solutions. Sociological Methods & Research, 44(4), 677–705
R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing.
RabeHesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data. Journal of the Royal Statistical Society. Series A: Statistics in Society, 169(4), 805–827
Rasch, G. (1960). Studies in mathematical psychology: I. Probabilistic models for some intelligence and attainment tests. Danmarks Paedagogiske Institut.
Robitzsch, A. & Oberwimmer, K. (2015). BIFIEsurvey: Tools for survey statistics in educational assessment. R package version, pp. 1–2.
Rubin, D. (1987). Multiple imputation for nonresponse in sample surveys. Wiley.
Rutkowski, D., & Rutkowski, L. (2013). Measuring socioeconomic background in PISA: One size might not fit all. Research in Comparative and International Education, 8(3), 259–278
Rutkowski, D., Rutkowski, L., Wild, J., & Burroughs, N. (2018). Poverty and educational achievement in the US: A lessbiased estimate using PISA 2012 data. Journal of Children and Poverty, 24(1), 47–67
Rutkowski, L. (2011). The impact of missing background data on subpopulation estimation. Journal of Educational Measurement, 48(3), 293–312. https://doi.org/10.1111/j.17453984.2011.00144.x
Rutkowski, L. (2014). Applied measurement in education sensitivity of achievement estimation to conditioning model misclassification. Applied Measurement in Education, 27(2), 115–132
Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International largescale assessment data: Issues in secondary analysis and reporting. Educational Researcher, 39(2), 142–151. https://doi.org/10.3102/0013189X10363170
Rutkowski, L., & Rutkowski, D. (2010). Getting it ‘better’: The importance of improving background questionnaires in international largescale assessment. Journal of Curriculum Studies, 42(3), 411–430
Rutkowski, L., & Zhou, Y. (2015). The impact of missing and errorprone auxiliary information on sparsematrix subpopulation parameter estimates. Methodology, 11(3), 89–99. https://doi.org/10.1027/16142241/a000095
SandovalHernandez, A., Rutkowski, D., Matta, T., & Miranda, D. (2019). Back to the drawing board: Can we compare background scales? Revista de Educación, 383, 37–61
Schneider, S. M. (2016). Income inequality and subjective wellbeing: Trends, challenges, and research directions. Journal of Happiness Studies, 17(4), 1719–1739
Thomas, V., Wang, Y., & Fan, X. (2001). Meaasuring education inequality. Gini coefficients of education. World Bank Publications.
Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54(3), 427–450
Wilkinson, R. G., & Pickett, K. E. (2006). Income inequality and population health: A review and explanation of the evidence. Social Science and Medicine, 62(7), 1768–1784
Zheng, B. (2008). Measuring inequality with ordinal data: A note. Research on Economic Inequality. https://doi.org/10.1016/S10492585(08)160082
Zheng, B. (2011). A new approach to measure socioeconomic inequality in health. Journal of Economic Inequality, 9(4), 555–577
Zheng, X. & Yang, J. S. (2016). Using sample weights in item response data analysis under complex sample designs. In Springer proceedings in mathematics and statistics (pp. 123–137). Springer. https://doi.org/10.1007/9783319387598_10.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or notforprofit sectors.
Author information
Authors and Affiliations
Contributions
As the sole author, LS conducted the analysis and wrote the whole manuscript. The author read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
The author gave consent for this article’s publication.
Competing interests
No potential conflict of interest is reported by the author.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Annexes
Annex 1
Lemma 1
\(I_{\varphi }\) satisfies the main properties of an inequality measure.

\(I_{\varphi }\) is continuous on the domain of distributions \(I\).

\(I_{\varphi }\) is invariant to permutations of the measure among students in the same population (anonymity invariance).

\(I_{\varphi } { }\) is invariant to any multiplication of each student score observation by any positive integer constant. The inequality measure is, therefore, independent of the aggregate level of income (scale invariance).

\(I_{\varphi }\) remains invariant to the size of the population, and therefore, to the replication of observation of the original population (population independence).

Redistributing benefits from richer to poorer individuals (without individuals’ reranking) reduces \(I_{\varphi }\), as the standard deviation at the numerator decreases while the denominator remains unchanged (Pigou–Dalton transfer).

\(I_{\varphi }\) takes a zero value when all individuals rank their health status identically (normalisation).
Proof of Lemma 1
(Continuity) \(I_{\varphi 1}\) and \(I_{\varphi 1}\) represent two inequality measures. If \(I_{\varphi 1} \approx I_{\varphi 2}\), then they will have very similar inequality values.
(Anonymity) Let \(x\) denote any distribution of assets with elements \(\left\{ {x_{1,} x_{2} , \ldots } \right\}\). As \(I_{\varphi } \left( x \right)\) depends only on the set \(\left\{ {x_{1,} x_{2} , \ldots } \right\}\), any permutation of elements of \(x\) does not produce changes in \(I_{\varphi } ,\) so \(I_{\varphi } \left( {P\left( x \right)} \right) = I_{\varphi } \left( x \right)\).
(Scale invariance) For any \(I_{\varphi } \left( x \right),\) multiplying a constant \(\gamma > 0\) to all elements of the set \(\left\{ {x_{1,} x_{2} , \ldots } \right\}\) produces \(I^{\prime}_{\varphi } \left( {x\gamma } \right) = I_{\varphi } \left( x \right)\).
(Population invariance) For any \(x,\) replicating the population would produce \(\xi^{\prime}_{{_{l} }} = \alpha_{l1} \varsigma_{l1} + \alpha^{\prime}_{l1} \varsigma_{l1} + \alpha_{l2} \varsigma_{l2} + \alpha^{\prime}_{l2} \varsigma_{l2} + \cdots + \alpha_{ln} \varsigma_{ln} + \alpha^{\prime}_{ln} \varsigma_{ln}\). Then \(I^{\prime}_{\varphi } = \frac{{\xi^{\prime}_{l} }}{{\xi^{\prime}_{p} }} = I_{\varphi } \left( {x \cup x} \right) = I_{\varphi } .\)
(Pigou–Dalton transfer property) Let \(\xi\) denote a wealth score of individuals \(l\) and \(m\), where \(\xi_{l} >\) \(\xi_{{\text{m}}}\). Let \(\widehat{{\xi_{l} }} =\) \(\xi_{l}  \delta \;{\text{ and}}\; \hat{\xi }_{m} = \xi_{m} + \delta ,\) when \(\delta\) > 0 transferred from \(l\) to \(m.\) Let \(I_{\varphi }\) and \(\widehat{{I_{\varphi } }}\) represent the initial and transformed inequality measure. As \(\sigma_{j} > \widehat{{\sigma_{j} }}\), then \(I_{\varphi } > \widehat{{I_{\varphi } }}\).
(Normalisation) For any \(x\) where \(\left\{ {x_{1} = x_{2} , \ldots } \right\}\), \(\sigma \left( \xi \right)\) = 0, then \(I_{\varphi } = 0.\)
This section suggests the inequality measure fulfils main properties customarily deemed desirable for an inequality measure, and therefore, can be accepted as a desirable measurement of inequality.
Annex 2
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sempé, L. Schoollevel inequality measurement based categorical data: a novel approach applied to PISA. Largescale Assess Educ 9, 9 (2021). https://doi.org/10.1186/s40536021001037
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40536021001037
Keywords
 PISA
 Item Response Theory
 Inequality
 Ordinal data
 School inequality
 HOMEPOS