Assessing the evidence for the comparability of socioeconomic status between students with and without immigrant background in Norway and Sweden

The prerequisite for meaningful comparisons of educational inequality indicators across immigration status is the comparability of socioeconomic status (SES) measures. The Programme for International Student Assessment (PISA) uses its index of economic, social, and cultural status (ESCS) to provide insights into the problems of inequality across students’ socioeconomic and immigration statuses. However, missing evidence regarding the comparability of the ESCS index or its components across students with and without immigrant background challenges the accuracy of empirical inferences. Our study sheds light on the comparability of the index of household possessions (HOMEPOS) across immigration status in Norway and Sweden—two countries that continue to be two largest recipients of immigration flows among their Nordic neighbours. We tested the PISA 2018 HOMEPOS scale for the overall measurement invariance and possible differential item functioning (DIF) across three student groups with first-generation, second-generation, or no immigrant background. Several HOMEPOS items exhibited DIF within these countries. Moreover, we examined how four strategies to deal with DIF items may affect the inferences regarding educational inequalities across immigration status. The strength of the HOMEPOS–achievement association was sensitive to the choice of approach for 15-year-old immigrant students, while it remained stable and moderate for native students. Our findings encourage researchers using the HOMEPOS scale to consider the invariance testing to avoid measurement bias and provide robust evidence characterizing immigrant achievement gaps.

however, is a broad indicator of economic status (Hannum et al., 2017), which may not truly capture family wealth across students with and without immigrant background within countries-participants. For example, HOMEPOS includes 'number of books at home' that was previously found to be potentially biased against immigrant student groups in another cross-country education survey-Trends in International Mathematics and Science Study (TIMSS) 2003 (Hansson & Gustafsson, 2013). Multiple factors may affect the item response patterns on a household possessions scale, e.g., culture, urban or rural place of residence, consumption preferences (Currie et al., 2008;May, 2006). These differences in an item ownership are natural as long as they do not become systematic, i.e., the item ownership largely reflects the belonging of a student to an immigrant group rather than the actual socioeconomic status. The more HOMEPOS items exhibit such trend, the less information on actual variability in family wealth may be derived and the lower may be the explanatory power of HOMEPOS for an immigrant achievement gap.
We approached the problem by investigating the comparability of the HOMEPOS scale across native and non-native students in Sweden and Norway. Of the three ESCS indicators, the HOMEPOS index has the strongest predictive power for reading achievement (Lee et al., 2019). The measurement and scaling procedures of HOMEPOS are continuously updated (Avvisati, 2020) with researchers focusing on the cross-country and cross-cycle comparability of HOMEPOS in the last decade (e.g., Lee & von Davier, 2020;Pokropek et al., 2017;Rolfe, 2021;Rutkowski & Rutkowski, 2013). We go a step further in the comparability analysis and unravel the complexities of HOMEPOS at the item level to understand whether each of the 22 international items works equally well for the native, first-generation, and second-generation immigrant students in Norway and Sweden. Our findings on the comparability of HOMEPOS items are then used to test how four approaches to handle non-comparable items (Cho et al., 2016;Liu & Rogers, 2021) influence the strength of HOMEPOS-reading achievement relationship for three student groups. This may guide future research in finding adequate SES measures to identify educational inequalities in diverse student subpopulations and circumvent measurement bias.

Immigrant achievement gap, PISA, and immigration trend
The seventh cycle of the Programme for International Student Assessment (PISA) provided insights into the problems of inequality in reading literacy across students' socioeconomic and immigration statuses. Denmark, Finland, Iceland, Norway, and Sweden were among the 11 participating countries for which this problem was the most pronounced for immigrant students. In Norway and Sweden, the largest receivers of immigrant families, the score-point differences in reading performance associated with immigrant background (after accounting for gender and students' and schools' socioeconomic profiles) were higher than the OECD average (Table 1; OECD, 2019a). Furthermore, an immigrant achievement gap persists in many OECD countries (Andon et al., 2014), with immigrant students' low academic achievement usually explained by the low SES of their foreign-born parents (Ammermüller, 2007;Marks, 2005;Shapira, 2012). Nevertheless, several studies highlight a weaker association between SES and achievement for students with immigrant background compared to non-immigrant students (Elmeroth, 2006;Kingdon & Cassen, 2010;Strand, 2014). The shortcomings of common SES indicators and their potential non-equivalence when capturing SES across the heterogeneous body of children and adolescents with and without immigrant background have been discussed elsewhere (Braveman et al., 2005;Fekjaer, 2007;Modood, 2012;Rothon, 2007). However, few studies in general evaluated measurement invariance of SES across immigration status (Hansson & Gustafsson, 2013;Lenkeit et al., 2015), with no study addressing this problem with the PISA data.
Many secondary analyses of the PISA data use the ESCS indicators as control or predictor variables to investigate factors associated with achievement gaps between immigrant and native students (e.g., Areepattamannil et al., 2013;Gramațki, 2017;Martin et al., 2012;Marx et al., 2012;Schnepf, 2007). Additionally, the ESCS index has been central to the construct of academic resilience (Agasisti et al., 2017;Cerna et al., 2021;Cheung et al., 2014;Gabrielli et al., 2021;OECD, 2018). The unawareness of how equally well the ESCS index or its components capture the SES of native and non-native students may impair the validity of findings and the effectiveness of policy recommendations. For instance, the non-comparability of the HOMEPOS index, one of the three ESCS components, may prove not useful in locating and explaining immigrant achievement gaps within countries, potentially compromising a just distribution of educational resources among schools with larger and smaller shares of immigrant students, or, e.g., inhibiting appropriate school budget allocations that are driven by findings of educational inequalities existing across immigration status.
This study is thus relevant for the OECD countries due to the need to understand causes behind a persistent immigrant achievement gap (Andon et al., 2014), and people's increasing global mobility (United Nations Educational, Scientific and Cultural Organization [UNESCO], 2018) which may challenge the validity of using identical SES measures across immigration status to capture educational inequalities. It is further relevant for the Nordic countries in the face of the refugee crisis of 2015, with Sweden and Norway having received the largest proportion of asylum seekers in the Nordic region (Byström & Frohnert, 2017;Hagelund, 2020). By the end of 2015, Sweden had registered approximately 163,000 refugees (Adan & Antara, 2018), whereas Norway had accepted about 31,000 (Parveen, 2020). These are substantial numbers considering that in 2015 Sweden's population was 9.8 million and Norway's 5.2 million. The latest refugee crisis caused by Russia's invasion of Ukraine, with 5 million Ukrainians registered in Europe Page 5 of 28 Mittal et al. Large-scale Assessments in Education (2022) 10:13 at the time of writing (United Nations High Commissioner for Refugees [UNHCR], 2022), suggests that the comparability testing of SES indicators across immigration status should be of a systematic nature. Our study is, hence, an attempt to facilitate such investigations which may in turn improve our understanding of challenges and successes that immigrant students experience in schools.

Comparability of the PISA SES measures across immigrant status
Meaningful group comparisons are prerequisites for the validity of findings in crosscultural studies (e.g., Rutkowski & Svetina, 2014;Van de Vijver, 2018). Such comparisons are valid if sufficient evidence that a scale and its items operate in the same way across populations exists (Bauer, 2017). For example, if we are to compare students' SES across different immigration statuses, the first step is to test for the equivalence or measurement invariance (MI) of this construct. We want to make sure that students' item responses are dependent solely on the level of SES they have and not on the effects of a group they belong to. The measurement invariance (MI) of PISA's ESCS index is therefore of great interest because it is one of the student background characteristics used to derive estimates of student achievement von Davier et al., 2009). Hence, a systematic lack of invariance of the ESCS index, its subscales, or items across, for instance, students with or without an immigrant background may bias the proficiency scores and thus the subsequent policy decisions . Furthermore, MI is a question of fairness and equity (Meredith, 1993). If the differences in the ESCS index, subscales, or items depend on certain students' characteristics and not on the differences in the students' level of SES, then the measure is biased against one group of students (Bauer, 2017;He & Van de Vijver, 2013). To our best knowledge, two empirical studies have investigated the invariance of SES measures across immigration status (Hansson & Gustafsson, 2013;Lenkeit et al., 2015). For instance, using TIMSS 2003 data, Hansson and Gustafsson (2013) operationalised SES by the mother's and father's educational level, the number of books at home, and the student's study aspirations. They concluded that the reflective latent variable SES had the same meaning across the eighth-grade students with Swedish and non-Swedish backgrounds. Conversely, large group differences in the probability of endorsing 'number of books at home' and 'mother's education' item categories indicated a potential bias against first-and second-generation immigrant students. The authors further suggested testing the comparability of family income and parental occupation 'to obtain a valid measurement model' of SES for the diverse groups of students (Hansson & Gustafsson, 2013, p. 163). The second known study used data from the Children of Immigrants Longitudinal Survey in Four European Countries (CILS4EU) in England and found that SES measures were 'not equivalent representations of family SES across different groups' with and without immigrant background (Lenkeit et al., 2015, p. 77). The authors warned researchers to compare SES and its associations with educational attainment across immigration status with caution.
Since PISA 2015, the scaling procedures for HOMEPOS, one of the three ESCS subscales, have partially addressed the problem of cross-cultural comparability (OECD, 2017). These procedures included the performance of country-by-language invariance testing for the countries that administered the PISA test in more than two languages Page 6 of 28 Mittal et al. Large-scale Assessments in Education (2022) 10:13 and with the weighted sample size of each language group above 300 (OECD, PISA 2018 Technical Report, Chapter 16, https:// www. oecd. org/ pisa/ data/ pisa2 018te chnic alrep ort/). The invariance tests were conducted across language groups for the Norwegian student sample (language groups: Nynorsk and Bokmål), yet not the Swedish one. However, the two language groups in Norway do not represent the immigrant backgrounds in the Norwegian student population. Evidence on the comparability of the HOMEPOS index across native and non-native student subpopulations in Norway and Sweden is still lacking.

PISA's household possessions index and its comparability
The household possessions-based aspect of SES is a reliable data source on family wealth that is less prone to error due to a high parent-student agreement and response rate (e.g., Andersen et al., 2008). This asset index captures "the presence or absence of various consumer durable goods and home construction features in their [students'] primary dwelling" (Traynor & Raykov, 2013, p. 664). This aspect may be particularly important for students with immigrant background since their parents' occupational status may not be indicative of their educational level in the country of origin (Lenkeit et al., 2015). Moreover, the household possessions may capture contributions to the family income by older employed sons and daughters of, e.g., South Asian immigrant background (Basit, 1997). In PISA, the HOMEPOS index, one of the three indicators used in the computation of the ESCS index, represents such aspect of SES that is also known to be a strong predictor of academic achievement (Lee et al., 2019). The measurement and scaling procedures of HOMEPOS have been continuously updated, and there was a low percentage of missing item responses, compared to other ESCS components (Avvisati, 2020). Previously, researchers argued that individual SES components are stronger predictors of inequalities than the composite SES indices (Watermann et al., 2016;White, 1982). As a subscale of ESCS, the HOMEPOS index may hence reveal the complex nature of equity mechanisms better than the composite ESCS index that keeps implicit the equity profiles of the countries (Keskpaik & Rocher, 2011). Besides, HOMEPOS is the only ESCS scale that manifests the trends within social, technical, and economic contexts across participating countries (OECD, 2017). Despite that, the challenge of PISA to keep the HOMEPOS items comparable across countries remains with several studies having established full or partial cross-cultural non-invariance of the construct (e.g., Lee & von Davier, 2020;Pokropek et al., 2017;Rutkowski & Rutkowski, 2013). This may happen because possessions do not have the same meaning across developed and developing countries (e.g., Kim et al., 2019). For example, Keskpaik and Rocher (2011) used PISA 2009 data and concluded that most HOMEPOS items were strong predictors of achievement for many non-OECD countries, whereas only two items regarding the availability of 'books of poetry' and 'classic literature' had a higher-than-average correlation with achievement across OECD countries. This finding points to possible cross-cultural differences or construct biases (e.g., He et al., 2019). Rapidly changing immigration patterns with at least one out of five 15-year-old students in the OECD having an immigrant background (UNESCO, 2018) may further hamper the detection of inequalities across immigrant populations within OECD countries. Despite or possibly due to this trend, the household possessions scale may still have a great potential to detect inequalities within and across Mittal et al. Large-scale Assessments in Education (2022) 10:13 diverse immigrant student sub-populations, compared to the indicators of occupational and educational status that may not be equivalent across immigration statuses (Lenkeit et al., 2015;Modood, 2012;Rothon, 2007). However, this potential can be tapped only when we learn to fully utilize the scale. By full utilization we mean examining the scale comparability at the item-level across the groups of interest prior to any further analysis. This is especially essential when the aim is to detect inequity or inequalities in schools.

The present study
The present study examines the comparability of the PISA 2018 HOMEPOS scale-an indicator of SES-across immigration status in Norway and Sweden. Specifically, we evaluate (a) the overall measurement invariance of the HOMEPOS scale; (b) the differential functioning of the HOMEPOS items; and (c) the relationship between HOMEPOS and reading achievement across immigration status. We further provide recommendations for the use of the HOMEPOS scale when comparisons across immigration status are of interest. Specifically, our study addresses the following research questions (RQs):

RQ1
To what extent does the measure of students' HOMEPOS demonstrate overall invariance across three student groups, namely native students and first-and second-generation immigrant students, in Norway and Sweden? ( Addressing the last question, we examine the following strategies: (a) Ignoring the existence of non-comparable items; (b) deleting non-comparable items; (c) deleting only non-uniform DIF items; and (d) accounting for non-comparable items in the HOME-POS measurement model.

Sample
The present study draws on the PISA 2018 data from nationally representative samples of 15-year-old students in Norway and Sweden (see Table 2). PISA 2018 followed a twostage stratified sampling design with a sample of 35 or 42 students per sampled school (OECD, PISA 2018 Technical Report, Chapter 16, https:// www. oecd. org/ pisa/ data/ pisa2 018te chnic alrep ort/). Each country had a unique list of stratification variables that indicated school characteristics and was used to aggregate schools into mutually exclusive groups prior to the school sampling (see Table 3). This information is essential for understanding differences in the mechanism of sampling students with an immigrant background. Given these unique country features, we report the within-country findings.

Household possessions scale
The HOMEPOS scale is one of the three components of the PISA 2018 Index of ESCS. It captures four aspects of family wealth, cultural possessions, home educational resources, information and communication technology (ICT) resources, and the number of books at home. The scale includes 22 indicators common across the participating countries and economies and up to three country-specific items. In Norway, the three national indicators on the availability of tablet computers, smart telephones, and e-book readers were added; in Sweden, students indicated the availability of a piano, cleaning services, and an espresso machine (OECD, PISA 2018 Technical Report, Annex E, https:// www. oecd. org/ pisa/ data/ pisa2 018te chnic alrep ort/). In our study, we used the 22 international indicators that the PISA team used to compute the HOMEPOS index. By using only international indicators we aimed at showing the same phenomenon for the two countries albeit we are not comparing them. The corresponding internal consistencies were α = 0.76 for Norway and α = 0.75 for Sweden (OECD, PISA 2018 Technical Report, Chapter 16, https:// www. oecd. org/ pisa/ data/ pisa2 018te chnic alrep ort/). Of the 22 items, 13 were scored dichotomously (0 = no, 1 = yes) and indicated the possession of 'a desk' , 'a room of one's own' , 'a quiet place to study' , 'a computer one can use for school work' , 'educational software' , 'a link to the Internet' , 'classic literature' , 'books of poetry' , 'works of art' , 'books to help with school work' , 'technical reference books' , 'a dictionary' , and 'books on art, music or design' . Eight polytomous items indicated the number of 'televisions' , 'cars' , 'rooms with a bath or shower' , 'cell phones with Internet access' , 'computers' , 'tablet computers' , 'e-book readers' , and 'musical instruments' (0 = none, 1 = one, 2 = two, 3 = three or more). The item 'books' had six categories: 0 to 10, 11 to 25, 26 to 100, 101 to 200, 201 to 500, and more than 500 books.
To illustrate the properties of items composing the HOMEPOS scale, we provide item parameter estimates and response distributions for the three immigration status groups in Norway and Sweden in Additional file 1: Appendix A.  (2); ISCED level (2); Urbanisation for lower secondary (3) 8 Geographic LAN-for upper secondary (21); Responsible authority-for upper secondary (3); Level of immigrants (3); Income Quartiles-for lower secondary/mixed (4) Mittal et al. Large-scale Assessments in Education (2022) 10:13

Immigration status
We used the index of immigrant background (IMMIG) provided in the PISA 2018 dataset to indicate the three groups of students within each country. This index distinguishes between (a) native students (i.e., students with at least one parent born in the country of assessment), (b) second-generation immigrant students (i.e., students born in the country of assessment with both parent(s) born in another country), and (c) first-generation immigrant students (i.e., they and their parents were born outside the country of assessment; see Table 2; OECD, PISA 2018 Technical Report, Chapter 16, https:// www. oecd. org/ pisa/ data/ pisa2 018te chnic alrep ort/). In our analyses, we refer to these categories as 'native' , '2ndGEN' , and '1stGEN' students, respectively.

Reading achievement
Reading literacy was the focal domain in PISA 2018 and was defined as 'understanding, using, evaluating, reflecting on and engaging with texts in order to achieve one's goals, to develop one's knowledge and potential and to participate in society' (OECD, 2019b, p. 28). This concept involves cognitive and metacognitive processes of navigating the plural realm of reading, effectively synthesising and integrating information from multiple sources, and being 'active, purposeful, and functional' in one's application of reading strategies in any given life scenario (see OECD, 2019b, p. 28). Three major components of reading literacy were further defined: texts (classified according to their source, structure, format, and type), cognitive processes (i.e., locating information, understanding, evaluating, reflecting, and reading fluently), and scenarios (see OECD, 2019b).
The computer-based reading literacy assessment contained 245 items (45 units) that were delivered to students in three adaptive stages. The response formats included selected-response, short-constructed, and open-response items. Each student responded to 33 to 40 items in 7 units within 60 min. Sixty-five reading fluency items were administered prior to the main test to better capture students' reading proficiency at the lower level of achievement. The multistage adaptive testing design was a new feature used for the reading domain in PISA 2018. Test reliabilities were 0.94 for both Norway and Sweden. The proficiency distribution in reading literacy is represented by 10 plausible values that account for the measurement uncertainty and ensure reliable achievement estimates in the population. In our analyses, we used all 10 plausible values .

Testing for measurement invariance and differential item functioning
The scaling procedures for the HOMEPOS items were based on the two-parameter logistic model (2PLM) for dichotomously scored responses and the generalised partial credit model (GPCM) for polytomous responses (OECD, 2017). Both models belong to the item response theory tradition of estimating the item response probability as a nonlinear relationship between categorical item responses and the latent trait theta, with the probability bounded between 0 and 1 (De Ayala, 2009). The 2PLM describes the probability that a student v responds in category 1 (e.g., checking the Page 10 of 28 Mittal et al. Large-scale Assessments in Education (2022) 10:13 specific home possession) to an item i as a function of the student's trait level θ v , the item difficulty b i , and the item discrimination a i (with a scaling constant D = 1.7 ; e.g., Desjardins & Bulut, 2018): This model extends the popular Rasch 1PL model by relaxing the equality constraint on the item discriminations a i , that is, allowing for item-specific relations between the item and the latent trait. In polytomously scored items, students can respond in several categories k = 0, . . . , m i . The GPCM describes the probability of responding in category k as a function of the student's trait level θ v , the item difficulty b i , the item discrimination a i , and the item threshold parameters d i between categories (with a scaling constant D = 1.7 and a zero sum of all threshold parameters for each item; see the OECD's PISA 2018 Technical Report, Chapter 9, https:// www. oecd. org/ pisa/ data/ pisa2 018te chnic alrep ort/): Similar to the 2PLM, this model allows for item-specific discriminations and is therefore more flexible than the PCM, in which these parameters are equal across items. In the present study, we adhered to the PISA procedure and implemented these models as reflective measurement models of HOMEPOS.
We took two approaches to test the equivalence of the HOMEPOS scale across immigration status: MI testing via multigroup item response theory (MG-IRT) modelling and testing for item-specific differential item functioning (DIF; see Bauer, 2017;Millsap, 2011). MG-IRT invariance testing allows for testing the scale's overall comparability ('scale functioning') and has limited sensitivity to identify non-invariant items; conversely, DIF testing identifies such non-invariant items (Bauer, 2017). Both approaches were implemented with the IRT treatment of the HOMEPOS scale in the framework of confirmatory factor analysis (CFA) using Mplus (Muthén & Muthén, 1998. Several researchers compared the IRT-and CFA-based MI testing and DIF detection (Kim & Yoon, 2011;Stark et al., 2006), and proposed an integrated IRT-and CFA-based approach (see, for instance, Dimitrov, 2017). Please find the respective Mplus input files in the Additional file 4: Appendix D.
Multigroup Item Response Theory Invariance Testing We estimated and compared three MG-IRT invariance models: the configural, metric, and scalar invariance models (Millsap, 2011). The configural invariance model estimates the cross-group equivalence of the setup of the factor structure, assuming the same number of factors and itemfactor patterns yet freely estimating model parameters. The metric or weak invariance model constrains the item discriminations across groups. This model establishes that the relationships between the latent variable and the manifest item responses are the same. Deviations from metric invariance indicate the presence of non-uniform DIF items. The test for scalar or strong invariance is a prerequisite for factor mean comparisons across groups. It builds upon metric invariance and constrains the item difficulties/thresholds Page 11 of 28 Mittal et al. Large-scale Assessments in Education (2022) 10:13 to be equal across groups (Bialosiewicz et al., 2013). The absence of scalar invariance indicates the presence of uniform DIF items. As a final step, the metric and scalar invariance models are compared to the configural model via likelihood-ratio tests, differences in information criteria, or other fit indices to examine the extent to which additional model constraints may deteriorate the model fit (Putnick & Bornstein, 2016).

Multiple-Indicators-Multiple-Causes Differential Item Functioning Testing
With their tests of global model fit, multi-group models have a low sensitivity to detect item-specific DIF across groups (Bauer, 2017). DIF occurs when the probability of endorsing an item varies for respondents with the same amount of latent trait depending on the group to which respondents belong (Stark et al., 2006). Two types of measurement non-invariance can be identified at the item level: uniform and non-uniform DIF (De Ayala, 2009). Uniform DIF is associated with group differences in item difficulties/thresholds (Stark et al., 2006). It occurs when the probability of answering an item correctly or selecting a higher response category is different for one subgroup over the entire range of its latent trait ( Fig. 1a; Woods, 2009). Non-uniform DIF is associated with situations in which item discriminations (factor loadings) and possibly item difficulties differ between groups. With regard to the HOMEPOS scale, this means that, for instance, the probability of endorsing the item 'books of poetry' may be equal across the subgroups with a HOME-POS score of '0' on the latent continuum but may be systematically higher or lower for one subgroup with a HOMEPOS score of '1' (Fig. 1b). Hence, an expected item response is a function of both group membership and the level of the HOMEPOS latent trait.
In the present study, we tested for uniform and non-uniform DIF via Multiple-Indicators-Multiple-Causes (MIMIC)-DIF modelling (e.g., Chun et al., 2016;Woods et al., 2009). MIMIC-DIF models introduce the grouping variable as an endogenous variable to the measurement model (Bauer, 2017). To test for DIF with the MIMIC approach, we implemented the constrained baseline method. This method begins by estimating a baseline model in which the two dummy-coded grouping variables 2ndGen and 1stGen are related only to the latent variable HOMEPOS; all other possible effects on items are constrained to 0 (Chun et al., 2016;see Fig. 1a). The following steps include tests for uniform and non-uniform DIF. To detect uniform DIF, the baseline model is extended by two paths connecting the grouping variables with an individual item ( γ b 1 i and γ b 2 i for an item i; see Fig. 1b). If the model fit improves significantly relative to the baseline, then the item is flagged with uniform DIF. This procedure is then repeated for all other items. To test for non-uniform DIF, we added two variables that represented the interactions between the latent variable HOMEPOS and the two grouping variables (specified via the 'XWITH' command in Mplus; see Fig. 1c). In the subsequent testing, both the grouping and interaction variables were regressed on one item at a time, with the latter reflecting the moderating effect on the latent variable HOMEPOS (paths γ a 1 i and γ a 2 i for an item i; see Chun et al., 2016). We compared models assuming non-uniform DIF to the corresponding uniform DIF models to see potential between-group differences in item discriminations in addition to item difficulties. The constrained baseline approach implements an all-other-item method in which all other items except the one studied are constrained to have equal parameters across groups and are assumed to be DIF-free (Wang et al., 2009). Given that Page 12 of 28 Mittal et al. Large-scale Assessments in Education (2022) 10:13 this approach included multiple significance tests, we adjusted the p-values using the Benjamini-Hochberg procedure for the 1% significance level. As part of our sensitivity analyses, we present the results obtained from an alternative method, namely the Fig. 1 Constrained baseline approach for DIF detection: a the baseline model with two covariates, 2ndGEN and 1stGEN (second-generation and first-generation immigrant student group variables); b the augmented model to test for uniform (threshold) DIF in the HOMEPOS items; c the augmented model to test for non-uniform (loading and threshold) DIF with two additional variables "HOMEPOS × 2ndGEN" and "HOMEPOS × 1stGEN" that represent the interactions between the latent variable HOMEPOS and two covariates sequential-free baseline method, in Additional file 3: Appendix C (for details, please refer to Chun et al., 2016).

Quantifying the household possessions-achievement relation via different strategies
To address the problem of DIF items in the HOMEPOS measurement model, we examined how four different approaches of treating DIF items affected the relation between HOMEPOS and reading achievement across immigration status (Fig. 2): Ignoring DIF In this approach, the latent variable HOMEPOS was represented by all items, irrespective of the evidence on DIF. Ignoring DIF can be a feasible approach when the interpretation is made on the population level (Cho et al., 2016). However, the parameter estimates may not be accurate if many items show DIF (Liu & Rogers, 2021).
Deleting DIF items In this approach, the latent variable HOMEPOS is represented only by items that exhibited neither uniform nor non-uniform DIF. In a recent simulation study by Liu and Rogers (2021), this strategy resulted in the largest average standard error and performed the worst under most conditions. Deleting DIF items may also reduce scale reliability and content validity due to the loss of information (Liu & Rogers, 2021).
Deleting non-uniform DIF items In this approach, only non-uniform DIF items are deleted from the HOMEPOS scale. This ensures the validity of group comparisons of the HOMEPOS-reading achievement relation, because non-uniform DIF items have different item discriminations (factor loadings) across groups.
Accounting for DIF In this approach, we accounted for uniform and non-uniform DIF items by allowing that the corresponding item parameters could vary between the reference and focal groups, with other items constrained to be equal across groups.

Analytic setup
The PISA 2018 data have a clustered structure with students nested in schools, which may have been purposefully over or under sampled in a specific region and may vary in size and non-response rates. This leads to unequal selection probabilities of students. To minimise this potential source of bias, we incorporated the final student weight (W_ FSTUWT) in our analyses (OECD, 2017).
Addressing RQ3, we included the 10 plausible values by estimating each model with achievement 10 times and combining the resultant model parameters via Rubin's combination rules . This procedure can be accessed in the software Mplus via the TYPE = IMPUTATION command. We performed all analyses using the software Mplus 8.5 (see Additional file 4: Appendix D for the inputs). All models were based on maximum-likelihood estimation with robust standard errors (MLR estimator) with a built-in expectation-maximization algorithm to handle missing data.

Testing for the measurement invariance of the overall scale (RQ1)
Prior to testing for the invariance of the overall HOMEPOS scale, we fit the single-factor GPCM for HOMEPOS to the data of the total student samples and in each of the three subsamples. Next, we estimated the configural invariance models for each country as baseline models (see Tables 4 and 5).
Page 15 of 28 Mittal et al. Large-scale Assessments in Education (2022) 10:13 Constraining further the item discriminations across groups in Norway (metric invariance) resulted in a significant loss of model fit, χ 2 (42) = 107.2, p < 0.001. Conversely, the BIC and aBIC information criteria parameters indicated an improvement in the model fit while the AIC parameter suggested a deteriorated fit (see Table 4). This potentially indicates a partial metric invariance which could be checked with further itemlevel DIF testing to identify items with non-uniform DIF. Similarly, the scalar invariance model indicated a significantly deteriorated fit compared to the configural model, χ 2 (138) = 12,483.8, p < 0.001. All three information criteria parameters also suggested a  For the Swedish data, the metric invariance model fit significantly worse than the baseline model, χ 2 (42) = 284.6, p < 0.001. This result was supported by the AIC and aBIC information criteria parameters and contradicted by the BIC parameter that indicated a minor improvement in the model fit. Hence, the poor metric invariance could set the stage for further item-level DIF detection to flag items with non-uniform DIF. Analogous to the metric model, constraining the thresholds (scalar invariance) resulted in a substantial loss of model fit, χ 2 (138) = 12,568.9, p < 0.001. This time all three information criteria parameters indicated a deteriorated model fit compared to the configural model. Consequently, we proceeded with identifying potential DIF items in the scale.

Uniform DIF
As noted earlier, we compared the baseline model for uniform DIF detection to models with direct paths from the two grouping variables 2ndGEN and 1stGEN to one item. Significant likelihood-ratio tests and uniform DIF effects for either of the focal groups would point to significant between-group differences in item thresholds and hence the presence of uniform DIF. Possible negative values of uniform DIF effects on certain items indicate that the reference native group had a higher expected score for those items after controlling for the level of the HOMEPOS latent trait. Conversely, positive values indicated that 2ndGEN, 1stGEN, or both focal groups had a higher probability of endorsing the items flagged for significant uniform DIF effects. In both cases, the difference in item response probabilities is assumed constant over the entire latent continuum.
For the Norwegian data, 14 HOMEPOS items demonstrated uniform DIF (Additional file 2: Appendix B: Table B1). Seven of these items had a significant difference in item thresholds in favour of the reference group and seven items in favour of the focal groups. Both 2ndGEN and 1stGEN reported on the availability of 'books to help with school work' , 'a dictionary' , and 'e-book readers' at a significantly higher frequency than the reference group did. Having the same HOMEPOS score, first-generation immigrant students consistently reported more often than native students that they have 'classic literature' , 'books of poetry' , and 'books on art, music or design' . Furthermore, it was more likely for 2ndGEN students than native students to answer that they have their own desk. Conversely, two focal groups with the same amount of the HOMEPOS latent trait as the reference group were less likely to endorse the items indicating the availability of 'a room of one's own' and the number of 'televisions' , 'cars' , 'musical instruments' and 'books' at home. In addition, the 1stGEN group had a significantly lower probability than the native group of endorsing the items regarding the number of 'rooms with a bath or shower' or 'tablet computers' .
In Sweden, 19 HOMEPOS items exhibited uniform DIF (Appendix B: Table B3); 10 of these were biased towards the two focal groups, and nine indicated that these groups had a significantly higher probability of endorsing the items after controlling for the HOMEPOS score. Both 2ndGEN and 1stGEN groups responded positively regarding the availability of 'books to help with school work' , 'technical reference books' , 'a dictionary' , 'educational software' , 'desk' , 'books of poetry' , and 'e-book readers' at a significantly Page 17 of 28 Mittal et al. Large-scale Assessments in Education (2022) 10:13 higher frequency rate. Additionally, the 1stGEN group had a significantly higher response probability for the 'books on art, music or design' item, and the 2ndGEN group had a consistently higher expected value on the item indicating the availability of 'a computer one can use for school work' than the native group did. The pattern of uniform DIF showed that the focal groups endorsed the items indicating educational resources and cultural possessions significantly more often than the reference group with the same level on the HOMEPOS trait did. Counter to that, the native group had a significantly higher expected value on eight items indicating family wealth, one cultural possession item ('musical instruments'), and the number of books at home item.

Non-uniform DIF
To test for non-uniform DIF, we compared the models with interaction effects to the corresponding uniform DIF models. Significant likelihood-ratio test statistics and interaction effects of HOMEPOS x 2ndGEN or HOMEPOS x 1stGEN would indicate the presence of non-uniform DIF. A positive interaction effect indicates that an item is less discriminating for the reference group. Different item discrimination parameters for each group imply that the between-group difference in endorsing an item is not constant over the latent continuum. For the Norwegian data, only one item (i.e., the availability of classic literature at home) exhibited non-uniform DIF between the 1stGEN and native groups (Additional file 2: Appendix B: Table B2). A significant negative interaction effect indicated that firstgeneration immigrant students who are average on the HOMEPOS latent trait were more likely to endorse the item than the native students were. For the Swedish data, eight items were flagged for the differences in discrimination parameters (Additional file 2: Appendix B: Table B4), two of which (i.e., number of 'televisions' and 'e-book readers') exhibited non-uniform DIF between the reference and both focal groups. The other six items had significant differences in their ability to discriminate between the native and 1stGEN groups. The first-generation immigrant students with the average level of the HOMEPOS latent trait were more likely to endorse the items on the availability of 'books of poetry' , 'books to help with school work' , and 'a dictionary' . The native students who were average on the HOMEPOS latent trait were more likely to select a higher category for the items regarding the number of 'televisions' , 'cars' , 'tablet computers' , 'e-book readers' , and 'books' .

Relations to reading achievement (RQ3)
To address RQ3, we investigated how four approaches to handle DIF items influenced the strength of the relationship between reading achievement and HOMEPOS across immigration status. This influence was compared across groups within each approach and across approaches for each group separately (see Table 6; Fig. 3). We conducted pairwise significance testing using the slope, standard error, and sample size.
For the native student subpopulation in Norway, the second approach of deleting 14 DIF items was distinct from others when comparing it both across the groups and across the approaches within one group. For example, compared to ignoring DIF in the regression analysis, deleting 14 DIF items (see Fig. 4a; Additional file 2: Appendix B: Table B1for DIF item names) increased the strength of the relationship between HOMEPOS and Page 18 of 28 Mittal et al. Large-scale Assessments in Education (2022) 10:13 reading achievement for 2ndGEN by 0.151 and for 1stGEN by 0.071 points. None of the four DIF treatment approaches made a significant difference for the strength of the relationship in the native group. For the Swedish data, the correlations were stronger for the native group than for the 2ndGEN and 1stGEN groups across the four approaches except for the 1stGEN group in Approach 3. When eight non-uniform items were deleted (see Additional file 2: Appendix B: Table B4 for DIF item names), the relationship between HOMEPOS and achievement for 1stGEN increased by 0.083 compared to the 'ignore DIF' approach. In Approach 2, we had only three comparable items that we used for the HOMEPOS latent variable (StuPlace, ClassLit, ArtWorks; see Fig. 4b; Additional file 2: Appendix B: Table B3). This did not change the correlations compared to other approaches for all the groups. Conversely, the correlation slightly increased for the native group that differed significantly from the 1stGEN group within the approach.

Measuring household possessions across immigration status
Previous research focused on the comparability of the HOMEPOS scale across countries and cycles (e.g., Lee & von Davier, 2020;Pokropek et al., 2017). Our study took a step further and unfolded the (non-)comparability of the HOMEPOS scale and its consequences across immigration status within Norway and Sweden in three steps (see Additional file 4: Appendix D for Mplus inputs).
First, we examined the overall invariance of the HOMEPOS measurement model scaled according to the PISA procedure (OECD, 2017) and could not find support for full metric invariance across immigration status within the Norwegian and Swedish PISA samples. Similar challenges were identified in earlier studies (e.g., Rutkowski & Rutkowski, 2013Sandoval-Hernandez et al., 2019). This finding may imply (1) a potential difference in the sociocultural value for certain items for students with and without immigrant background (Brese & Mirazchiyski, 2013;Yang & Gustafsson, 2004); (2) a systematic failure to capture actual differences in SES across the student groups. The latter means that the item ownership systematically depends on culture, geography Table 6 Regression coefficients reflecting the relationship between HOMEPOS and reading achievement across approaches and groups # The regression coefficient of the 1stGen group was found to be significantly different from that of the native group within this approach (p < .05)  (May, 2006), i.e., the part of the country one lives in, or consumption preferences (Currie et al., 2008). The item response patterns will certainly vary due to these factors motivating the relevance of specific item ownership; however, this variation should reflect true variability in wealth rather than belonging to an immigrant or non-immigrant student group. In practice, full or partial metric non-invariance suggests that two students with the same actual level of SES but different immigration statuses will have different SES scores or vice versa (Lenkeit et al., 2015). This questions the valid use of the HOMEPOS scale scores for cross-immigrant group comparisons. Additionally, the lack of invariance for HOMEPOS may constrain meaningful comparisons across immigration status with the ESCS index that comprises three indicators, namely, household possessions, highest Page 20 of 28 Mittal et al. Large-scale Assessments in Education (2022) 10:13 parental education, and occupation. In PISA 2018, the ESCS index was constructed as the arithmetic mean of these three indicators that were given equal arbitrary factor loadings (see the OECD's PISA 2018 Technical Report, Chapter 16, https:// www. oecd. org/ pisa/ data/ pisa2 018te chnic alrep ort/). As of this step, we retrieved no sufficient evidence for comparability of HOMEPOS. Hence, transferring the inferences drawn from the HOMEPOS scale to subgroups of students with different immigration statuses can create bias and misinform policymaking (Hansson & Gustafsson, 2013;Lenkeit et al., 2015). For the second RQ, we identified items that functioned differently across immigration status and found several items exhibiting DIF. However, the findings varied in terms of the number and type of non-comparable items (i.e., uniform and non-uniform DIF) in the Norwegian and Swedish samples, the student group (i.e., certain items exhibited DIF only for the first-or second-generation immigrant students), and the relation of an item Fig. 4 Quantifying the HOMEPOS-reading achievement relationship after deleting non-comparable items: a HOMEPOS measurement model in the Norwegian sample is represented by comparable items-items that do not exhibit uniform or non-uniform DIF; b HOMEPOS measurement model in the Swedish sample is represented by comparable items-items that do not exhibit uniform or non-uniform DIF Mittal et al. Large-scale Assessments in Education (2022) 10:13 to the HOMEPOS scale (i.e., which group the item was biased against) (for details, see Additional file 2: Appendix B). A possible explanation to our finding may be the variation in ethnicities that second-and first-generation immigrant groups belong to in Norway and Sweden (Fekjaer, 2007;Heath et al., 2008;Lundahl & Lindblad, 2018). Furthermore, non-uniform DIF 1 was mainly observed between the first-generation immigrant and native students, although to a varying degree in two countries (Additional file 2: Appendix B). Since a potentially greater assimilation level is observed among the second-generation immigrant students (Alba et al., 2011;Drouhot & Nee, 2019;Heath et al., 2008;Hermansen, 2016;Jonsson & Rudolphi, 2011), our finding may imply that possessions hold a more equivalent value and relevance to the household circumstances of the native and second-generation immigrant students compared to the first-generation immigrant group. Hence, treating these two immigrant groups as one against the native group when comparing educational inequalities is of questionable value. Overall, we found a tendency for wealth possession items (e.g., number of televisions, cars, bathrooms) and the number of books at home to be biased against students with immigrant background. Previously, Rutkowski and Rutkowski (2018) found the PISA 2012 wealth possessions scale overall non-comparability across Nordic countries and low parent-student agreement on the number of books at home. Besides, bringing books from the country of origin may be challenging and impractical, even if they were collected over generations (Elmeroth, 2006;Hansson & Gustafsson, 2013;Lenkeit et al., 2015). Conversely, immigrant students were more likely to endorse e-readers and home educational resources regardless of their HOMEPOS level. Since immigrant parents commonly have high aspirations for their children (e.g., Basit, 2012;Drouhot & Nee, 2019;Fekjaer & Leirvik, 2011;Lauglo, 1999), their priority may be mobilizing capital for providing a study-motivating environment (Modood, 2005).
Finally, we examined how four approaches to adjust for DIF items influenced the strength of the HOMEPOS-reading achievement relationship to provide recommendations for the use of the scale. Neither of approaches had any effect on the HOME-POS-achievement association for the native students in both countries 2 and for the second-generation immigrant students in Sweden. The HOMEPOS-achievement relationship remained moderate and stable even after deleting as many as 14 and 19 noncomparable items in Norway and Sweden, respectively. Conversely, two approaches for deleting non-comparable items considerably strengthened the HOMEPOS-achievement relationship for two immigrant student groups in Norway (after deleting all DIF items), and for the first-generation immigrant students in Sweden (after deleting nonuniform DIF items; see Figs. 3, 4, & Table 6; see Additional file 2: Appendix B for the full list of uniform and non-uniform DIF items).
Several practical implications arise from our findings for the use of the HOMEPOS scale. First, the non-comparability of multiple items suggests the limited nature of Page 22 of 28 Mittal et al. Large-scale Assessments in Education (2022) 10:13 inferences we may draw about immigrant and non-immigrant student sub-populations with regard to their success or failure in schools. This necessitates invariance testing of the HOMEPOS measurement model to ensure that it reflects true variability in family wealth across all three student groups. Second implication concerns the deletion of non-comparable items that did not affect the strength of HOMEPOS-achievement association for native students in both countries, and 2ndGEN students in Sweden. The household possessions indices usually have a strong predictive power for academic achievement (Hannum et al., 2017;Lee et al., 2019). Hence, two conclusions may arise from our finding, (1) the HOMEPOS scale truly has a lower explanatory power for the achievement of groups specified above; (2) the deleted DIF items are potentially noneffective for capturing the SES of those groups. The latter is a common problem among higher-income countries (Avvisati, 2020), or countries with higher levels of wealth equality, since it is difficult to develop items that adequately discriminate among groups with different SES levels (Traynor & Raykov, 2013). Further analysis of the HOMEPOS item properties may give an insight of how well each item discriminates among advantaged and disadvantaged students across immigration status. 3 Third, ignoring noncomparable items by using the HOMEPOS index potentially masks high importance of SES for immigrant student achievement. Certain items (e.g., wealth possessions) may be negatively associated with reading achievement (Brese & Mirazchiyski, 2013;Traynor & Raykov, 2013), hence concealing SES effects. Fourth, an approach to account for DIF items which is usually preferred to deleting or ignoring DIF (Cho et al., 2016;Liu & Rogers, 2021) had no effect on the strength of HOMEPOS-achievement relationship for any group. The effectiveness of the approach is thus questionable. Fifth, we suggest caution in using the ESCS index to capture SES or to interpret educational inequalities across immigration status since HOMEPOS may compromise its adequate functioning. We further advise invariance testing for parental occupational status and educational level, since several researchers indicated potential problems with the equivalence of these socioeconomic status indicators (Lenkeit et al., 2015;Modood, 2005;Rothon, 2007). To conclude, several studies suggested the redundancy of the idea that all items function in the same way across different countries or groups (Rutkowski & Rutkowski, 2018), further introducing new methods, such as implementing partial invariance constraints to improve cross-country comparability (Lee & von Davier, 2020). Our study, however, took a more conservative approach and illustrated that, even after accounting for noncomparable items, we risk misinterpreting the SES -achievement relationship for immigrant student groups.

Limitations
Our study has some limitations that suggest future research directions. First, the PISA's IMMIG does not allow us to generalise our findings to specific ethnic groups due to the vague distinction between the native and second-generation immigrant categories (Basarkod et al., 2022), which may have assigned students who had one or two parents of the same ethnicity born outside the country of assessment to the different categories.