Skip to main content

An IERI – International Educational Research Institute Journal

The potential of international large-scale assessments for meta-analyses in education


Meta-analyses and international large-scale assessments (ILSA) are key sources for informing educational policy, research, and practice. While many critical research questions could be addressed by drawing evidence from both of these sources, meta-analysts seldom integrate ILSAs, and current integration practices lack methodological guidance. The aim of this methodological review is therefore to synthesize and illustrate the principles and practices of including ILSA data in meta-analyses. Specifically, we (a) review four ILSA data inclusion approaches (analytic steps, potential, challenges); (b) examine whether and how existing meta-analyses included ILSA data; and (c) provide a hands-on illustrative example of how to implement the four approaches. Seeing the need for meta-analyses on educational inequalities, we situated the review and illustration in the context of gender differences and socioeconomic gaps in student achievement. Ultimately, we outline the steps meta-analysts could take to utilize the potential and address the challenges of ILSA data for meta-analyses in education.

Evidence-based decision-making is key to educational policy and practice. To facilitate this, researchers synthesize the body of evidence on, for instance, the effectiveness of educational programs, the factors related to desirable educational outcomes, and possible sources of variation or inequalities in education via meta-analyses (Hattie et al., 2014; Oh, 2020). These quantitative research syntheses must provide reliable, meaningful, and unbiased evidence so that valid inferences can be drawn by researchers, practitioners, and policymakers (Slavin, 2008). However, meta-analyses in education and other disciplines face several challenges compromising their validity (e.g., Ahn et al., 2012; Rios et al., 2020; Sharpe, 1997): small-sample primary studies (e.g., low power to detect practically relevant effect sizes, high uncertainty, risk of invalid generalizations to student populations), study characteristics that may affect the quality and magnitude of effects (e.g., convenience samples, lack of stratification, matching, or control groups), and insufficient psychometric quality of the outcome measures (e.g., low reliability, limited construct coverage)—just to name a few. International large-scale assessments (ILSAs), such as ICILS (International Computer and Information Literacy Study), TIMSS (Trends in International Mathematics and Science Study), and PISA (Programme for International Student Assessment), address many of these issues (Braun & Singer, 2019; Klieme, 2020; Rutkowski et al., 2010; Wagemaker, 2016).

Despite this potential, it is not a common practice to include ILSA data in meta-analyses on key educational research questions. For instance, in their meta-analysis of the relation between socioeconomic status and student achievement, Kim et al. (2019) and Scherer and Siddiq (2019) included ILSA and non-ILSA data side-by-side, while Sirin (2005) and Harwell et al. (2017) based their meta-analyses solely on non-ILSA data, although ILSA data would have been eligible for inclusion. Similarly, some meta-analyses of the gender differences in student achievement included both ILSA and non-ILSA data (Lietz, 2006; Siddiq & Scherer, 2019), while some focused only on non-ILSA (Lindberg et al., 2010) or ILSA data (Else-Quest et al., 2010). The complexities of analyzing primary ILSA data and the resultant meta-analytic data may provide some reasoning for these varying practices. These complexities include the multi-stage cluster sampling designs that need to be represented when estimating effect sizes, the availability of multiple effect sizes per ILSA, ILSA cycle, or country, and the lack of analytic approaches guiding the integration of ILSAs in meta-analyses (e.g., Hedges, 2007; Rutkowski et al., 2010).

To this end, the inclusion of ILSA data in meta-analyses has faced two key challenges: Varying inclusion practices, likely due the lack of methodological guidance, and the complex structures of ILSA and meta-analytic data that demand non-standard effect size computation and advanced meta-analyses. Our methodological review addresses these challenges by (a) describing an analytic framework that comprises four inclusion approaches; (b) reviewing systematically whether and how existing meta-analyses included ILSA data; and (c) illustrating these approaches with an example meta-analysis. Drawing from the results of our review, we offer recommendations for researchers on how to include ILSA data in their meta-analyses to inform evidence-based practice and policymaking.

International large-scale assessments (ILSAs) informing meta-analyses in education

Purposes and contribution

Meta-analyses and ILSAs have similar purposes. Oh, (2020) identified three evidence-based uses of meta-analyses: (a) Informing the design of empirical studies; (b) informing the interpretation of the effect sizes resulting from primary studies by creating context and providing benchmarks; (c) informing educational practice and the development of professional guidelines for research. Besides, meta-analyses have several theoretical uses, such as providing information about population effect sizes, quantifying heterogeneity, and identifying the extent to which sample, study, and measurement characteristics could explain this heterogeneity (Borenstein et al., 2009). Ultimately, meta-analyses are aimed at supporting research, practice, and policy in drawing robust conclusions about key educational issues and explaining how and why specific findings may fit together or deviate (Glass, 1976; Siddaway et al., 2019).

Similarly, ILSAs provide large-scale, representative, and international data to (a) increase the understanding of key factors influencing teaching and learning, including contextual factors; (b) identify key educational issues; (c) inform national strategies for monitoring and improvement, including evaluating the effectiveness of curricula, instruction, and policies; (d) contribute to the research community to facilitate educational evaluation and document progress in research; (e) create de facto benchmarking, providing context for small-scale research and tracing student achievement across nations and over time (Braun & Singer, 2019; Hopfenbeck et al., 2018; Wagemaker, 2016). The next sections demonstrate that ILSAs have a value in their own right for meta-analyses in education and how they may address some of the challenges meta-analyses are facing.

Potential, challenges, and limitations of including ILSA data in meta-analyses in education

Key educational issues and constructs

ILSAs contain rich indicators of educational achievement, oftentimes in several domains and sub-domains, motivational and affective constructs, background characteristics, and contextual factors, which are measured across the different levels of educational systems and over time. These indicators are documented transparently and allow researchers to assess and monitor key educational issues, such as equity and equality, trends and profiles of student achievement, and the link between school practices and educational outcomes (e.g., Klieme, 2020; Lenkeit et al., 2015). Despite the rich set of constructs and indicators, the feasibility and time constraints in which ILSAs operate allow for including only a selection of constructs, types of tasks, and scales (Gustafsson, 2018; Kuger & Klieme, 2016), so that selection of educationally relevant constructs is by no means exhaustive. Hence, ILSA data do not qualify for inclusion in any meta-analysis in education but need to undergo a rigorous eligibility check.

Country selection

ILSAs follow a rigorous sampling designs with multiple stages of quality assurance (Musu et al., 2020; Wagemaker, 2020). While much emphasis has been placed on the random sampling within countries (e.g., random sampling of schools and students or teachers within schools in PISA and, respectively, TALIS), the sampling of countries participating in ILSAs is not random. In fact, countries decide to participate in ILSAs and ILSA cycles and, essentially, self-select depending on their needs and capabilities. This self-selection has several consequences, such as the varying participation across ILSAs and ILSA cycles with countries remaining, dropping out, or joining in, and the possible underrepresentation of cultures or World regions (e.g., Rutkowski & Rutkowski, 2021). The varying participation of countries challenges the study of educational trends due to the lack of consistent longitudinal data at the level of countries (e.g., Lohmann et al., 2022).

Given that ILSAs include a broad range of countries, cultures, and educational systems, including ILSA data in meta-analyses can balance the representation of cultural and language groups—in fact, possible cultural and language bias may be reduced in meta-analyses (Morrison et al., 2012). While meta-analysts oftentimes exclude studies and reports that were not published in English, the information on the various ILSA samples, assessments, and results are made available in English, irrespective of the language of origin in the countries. For instance, in their meta-analysis of the relation between academic achievement and self-concept, Möller et al. (2020) included multiple PISA samples from around the world—cultural balance was of particular importance in this study, due to the cultural differences in students’ self-concept.

Availability and comparability of ILSA data

Finally, we would like to highlight the availability of the primary data along with rich documentation as another strength of ILSAs—specifically, for the most part, ILSA data are freely available to meta-analysts through open-access platforms of the respective organizations (primarily the IEA and OECD). Given the availability of these so-called individual-participant data (IPD), meta-analysts may not have to rely on the reporting from secondary sources, but can extract or estimate the relevant effect sizes themselves (Riley et al., 2021). For instance, if researchers are interested in the achievement differences between private and public schools after controlling for schools’ socioeconomic composition and individual differences in socioeconomic status, they can specify and estimate a multilevel regression model with the variables of interest, utilizing multiple ILSA data sets. Across these data sets, the model generating the effect sizes is the same, and comparability of the type of effects is given. However, if the researchers extracted the achievement differences from secondary reports which were based on different multilevel regression models (e.g., with different predictors), the resultant effect sizes would no longer be comparable, and the validity of the meta-analytic results would be in question (Becker & Wu, 2007). This is a key issue in meta-analyses that are based on aggregated data (AD) and effect sizes generated from different analytic models (e.g., Polanin et al., 2020; Riley et al., 2021). In this sense, ILSA data allow meta-analysts to control the specification and estimation of the statistical models used to generate the effect sizes (Cheung & Jak, 2016).

Measurement invariance

Besides the comparability of the data setup, many ILSAs, ILSA cycles, and samples are based on the same or linked measures of constructs. However, although this design ensures some degree of comparability or, more precisely, a similar exposure to items and tasks, it does not ensure measurement invariance—the comparability of the measurement models underlying reflectively defined constructs—per se. Researchers and large-scale data analysts still have to provide evidence that the measurement models representing specific constructs are sufficiently invariant (van de Vijver et al., 2019). However, extending the range of participating countries and educational systems, population heterogeneity in ILSAs can be problematic, because deficits in invariance may undermine the comparability of measures (Rutkowski et al., 2019). This issue does not only concern the measurement of student achievement, which has received most of the attention in ILSA-related research, but also the measures taken via the accompanying questionnaires (Rutkowski & Rutkowski, 2018). For instance, in some ILSAs, the constructs measured via the background questionnaires are not fully aligned with the core achievement domains, so that obtaining evidence on convergent validity is hardly possible.

Correlational nature of the data

The correlational nature of the ILSA data, resulting from the cross-sectional study design, may be another issue that could exclude these data from meta-analyses in education, especially when effectiveness questions are addressed that require (quasi-)experimental designs (Klieme, 2013). In fact, ILSAs offer only limited opportunities to draw causal inferences (Rutkowski & Delandshere, 2016), and may inform meta-analysis primarily by group differences (e.g., gender differences in student achievement) or relations among constructs (e.g., relation between self-concept and student achievement). For instance, research questions on the effectiveness of instruction can hardly be addressed directly, and randomized-controlled trials would obviously be the gold standard to inform such questions. Given their design, ILSA data would not be eligible for inclusion in meta-analyses of the effectiveness of instruction. Yet, ILSAs could still provide information about the distribution of relevant variables and their relations to educational achievement reported (Braun & Singer, 2019; Klieme, 2020).

Complex survey designs and large samples

Another challenge associated with the use of ILSA data in meta-analysis is the extraction of the correct effect sizes. ILSA data follow a complex survey design with multiple stages of sampling that require advanced methods to estimate effects (Rust, 2014)—among others, the key elements include the multilevel data structure (e.g., students hierarchically nested in schools), the use of sampling weights (e.g., student- and school-level weights), the correct variance estimation (e.g., via jackknifing techniques and replicate weights), and the achievement estimation (e.g., via plausible value techniques). For instance, if meta-analysts are interested in the relation between measures of instructional quality in classrooms and student achievement, multilevel modeling is required to account for the nested structure of the primary ILSA data (“primary clustering”) and obtain the contextual effect. While these elements have been discussed and presented in the extant literature extensively (Rutkowski et al., 2010), we suspect that addressing them to extract the relevant effect sizes from secondary data analyses can pose barriers for meta-analysts. Associated with the complex survey design are the large sample sizes within ILSA data. Large-sample studies may well increase precision and reduce sampling error, yet may also influence the effect size estimate and its variance components substantially due to large weights in the meta-analytic data set (Turner et al., 2013).

Complex meta-analytic data structures

Besides the primary clustering of the ILSA data, including them in meta-analyses can create a nested structure of the meta-analytic data (“secondary clustering”), with multiple effect sizes extracted from the ILSAs, ILSA cycles, or countries (e.g., Pigott & Polanin, 2020). Such structures can violate the independence assumption in meta-analysis and require meta-analysts to address them, for instance, via multilevel meta-analysis, robust variance estimation, or pooling approaches (Cheung, 2019; Pustejovsky & Tipton, 2021; Scammacca et al., 2014). Notice that the primary clustering represents a different analytic problem than the secondary clustering: While the former describes the structure of the primary study data with, for instance, students nested in classrooms, the latter describes the structure of the meta-analytic data with multiple effect sizes nested in, for instance, ILSAs. Addressing the secondary clustering still requires from meta-analysts the knowledge and skills to engage in advanced meta-analytic techniques.

The present study

Given the diversity of the ways in which meta-analysts include or ignore ILSA data and, at the same time, the current lack of guidelines informing this inclusion, our methodological review describes and illustrates the principles and practices of including ILSA data in meta-analyses. Specifically, we address the following three research questions:

  1. 1.

    Which analytic approaches can meta-analysts take to include ILSA alongside non-ILSA data, and what are their advantages and disadvantages? (Inclusion approaches)

  2. 2.

    To what extent have ILSA data been included in existing meta-analyses in education, and how? (Inclusion status)

  3. 3.

    How can the different inclusion approaches be implemented in meta-analyses? (Inclusion implementation)

We constrained our review of existing meta-analyses and the presentation of an illustrative example to the context of equality in education, given the otherwise unmanageably large body of meta-analyses in education (Ahn et al., 2012) and given the need for meta-analyses of issues related to educational equality (Broer et al., 2019).

Approaches of including ILSA data in meta-analyses in education

The meta-analytic literature on multilevel meta-analysis (Fernández-Castilla et al., 2020), meta-analysis with individual participant data (Burke et al., 2017), and Bayesian meta-analysis (Röver, 2020) offers a plethora of approaches to synthesize effect sizes from small- and large-scale studies, with or without complex data structures, including one- and two-stage procedures. On the basis of these approaches and the knowledge gained from the systematic reviews addressing our first research question, we propose an analytic framework that contains four approaches to include ILSA data in existing meta-analyses at the level of effect sizes:

  1. 1.

    Separate meta-analyses: ILSA and non-ILSA data are meta-analyzed separately.

  2. 2.

    Indirect inclusion via Bayesian meta-analysis: In a first step, the multiple effect sizes per ILSA are meta-analyzed, yielding estimates of the weighted average effect size and heterogeneity. In a second step, one or more of these estimates inform the prior distribution of the weighted average effect size and the heterogeneity for the non-ILSA data.

  3. 3.

    One-stage direct inclusion: ILSA and non-ILSA data (i.e., effect sizes) are included in a meta-analysis side-by-side and at the level of the effect sizes. For ILSA data, multiple effect sizes (e.g., for multiple countries or domains) are extracted.

  4. 4.

    Two-stage direct inclusion: In the first stage, the multiple effect sizes per ILSA are meta-analyzed or aggregated following some aggregation rules (e.g., Borenstein et al., 2009). In the second stage, the resultant, aggregated effect sizes for each ILSA are included in the meta-analysis next to the non-ILSA data.

In the following, we review the analytic steps, advantages, and challenges associated with each of these four approaches (see also Table 1).

Table 1 Overview of the Different Inclusion Approaches

Separate meta-analyses

Separate meta-analyses of ILSA and non-ILSA data do not directly integrate the two data sources. Ultimately, they result in separate estimates of the weighted average effect sizes and variance components (Table 1), which could inform alternative approaches, such as the two-stage direction and indirect inclusion approaches, or serve the purpose of benchmarking (e.g., ILSA effect sizes as benchmarks for non-ILSA effect sizes, or vice versa). Nevertheless, if the same meta-analytic models are specified for the two data sources, direct comparisons of the overall effect sizes are possible utilizing mixed-effects models and Wald tests even under heteroscedasticity (Rubio-Aparicio et al., 2020). Meta-analysts can examine moderator effects separately and compare the results qualitatively. If comparisons of effect sizes are not the main focus, conducting separate meta-analyses further allows researchers to specify different meta-analytic models for the ILSA and non-ILSA data, addressing their individual complexities (e.g., non-nested structure of the non-ILSA data, nested structure of the ILSA data; Table 1).

Indirect inclusion via Bayesian meta-analysis

If the primary interest of the meta-analysts lies in the meta-analysis of non-ILSA data, information from the meta-analysis of ILSA data could be incorporated indirectly via Bayesian meta-analysis. In this approach, the weighted average effect size and/or variance components derived from ILSA data can inform the distributions of the respective estimates for non-ILSA data (see Table 1 and Additional file 5: S5). While a general discussion of Bayesian meta-analysis is beyond the scope of this study, one key advantage lies in the possibilities for researchers to incorporate some prior knowledge in their meta-analysis, even when only few effect sizes are available (Röver, 2020). This inclusion approach is similar to Bayesian (historical) borrowing, in which prior information about distributions or effect sizes from previous ILSAs or ILSA cycles is used to inform the data analysis of new ILSAs, ILSA cycles, or other studies (Kaplan et al., 2023). However, specifying informative priors and random-effects models in the Bayesian framework requires some understanding of the possible parameter distributions and may thus not be easily accessible to meta-analysts. Moreover, the meta-analytic outcomes for the non-ILSA data may depend on the choice of priors, thus necessitating additional sensitivity analyses.

One-stage direct inclusion

The one-stage direct inclusion approach combines the ILSA and non-ILSA data directly at the level of effect sizes. For each ILSA study or wave (e.g., PISA 2006, PISA 2015), each country or cohort sample contributes an effect size (see Fig. 1a). This inclusion is comparable to the one-stage meta-analysis of individual participant data, in which multiple data sets are combined directly (Burke et al., 2017). If meta-analysts allow for including multiple countries or cohort samples, this direct inclusion ultimately results in a complex meta-analytic structure with multiple effect sizes per ILSA. Such a structure violates the basic assumption of the independence of effect sizes, because effect sizes from the same ILSA may be more homogeneous than effect sizes from different ILSAs (Borenstein et al., 2009). As a consequence, meta-analysts must determine the structure of the meta-analytic data set and choose among suitable approaches to estimate the overall effect sizes and/or moderation effects that represent this structure (Cheung, 2019). Figure 2 illustrates two of the possible structures meta-analysts may encounter in this situation: Given the availability of multiple effect sizes per ILSA, the “ideal” structure with one effect size per ILSA only does no longer apply (Fig. 2a). Instead, a hierarchical structure with multiple effect sizes nested in ILSAs (Fig. 2b) or a non-hierarchical cross-classified structure with multiple effect sizes nested in ILSAs and countries (Fig. 2c) may better represent the meta-analytic data. The latter may be especially useful when including multiple ILSAs or ILSA cycles. If however only one ILSA or ILSA cycle is included, the country-specific effect sizes are considered independent, and the non-ILSA data contribute one effect per study, the structure may be simplified to Fig. 2a.

Fig. 1
figure 1

Overview of the a one-stage and b two-stage direct inclusion approaches. Note. ILSA = International large-scale assessment, K = Number of effect sizes extracted from ILSAs, L = Number of effect sizes extracted from the non-ILSA studies

Fig. 2
figure 2

Meta-analytic data structures: a Common two-level hierarchical structure; b Three-level hierarchical structure; and c Cross-classified non-hierarchical structure

Having identified the data structure, meta-analysts can then choose how to handle such dependencies (see Table 1). While the described structures can be modelled explicitly via multilevel meta-analysis, a random-effects modeling approach that quantifies variation at the respective levels of analysis (e.g., within and between studies), or implicitly considered via robust variance estimation (RVE; e.g., Fernández-Castilla et al., 2020; Hedges et al., 2010). Given the variety of approaches to handling multiple effect sizes, meta-analysts may consider conducting sensitivity analyses, varying these approaches and examining the possible differences in the resultant estimates (Table 1). Later in the data-analytic process, the possible differences between the effect sizes extracted from ILSA and non-ILSA studies can be examined and the effects of including ILSA data quantified. While the direct inclusion may require advanced meta-analytic models, meta-analysts can obtain information on different variance components, examine moderator effects at different levels of analysis, and gain precision in the effect size and variance estimates due the increased sample size (see Table 1).

In situations where ILSA studies provide the individual-participant data, and non-ILSA studies provide aggregated data, the possible differences in effect sizes between them may point to “availability bias”—a form of bias that occurs when the availability of IPD is associated with the quality of the primary study or its effect size (Riley et al., 2021). Although incorporating IPD from ILSAs can reduce publication bias due to the possible inclusion of unpublished data sets and studies, IPD may not be available for every primary study, for instance, due to issues related to data protection or accessibility. Hence, we consider the sensitivity analyses and testing for possible differences between ILSA and non-ILSA data to be important for meta-analyses combining these data.

Two-stage direct inclusion

Unlike the one-stage approach, the two-stage approach handles the multiple effect sizes per ILSA or ILSA cycle by pooling them first and submitting the resultant, pooled effect size and sampling variance to the meta-analysis with non-ILSA data (see Fig. 1b). This approach is similar to that two-stage meta-analysis of individual participant data (Burke et al., 2017). To perform the first stage, meta-analysts may rely on, for instance, Borenstein et al., (2009) formula to pool the effect sizes (to the average effect size) and the respective sampling variances (to a pooled variance which includes correlations between the effect sizes within a study). Alternatively, the pooled effect size may also be derived via separate meta-analyses for each of the ILSAs or ILSA cycles (see Table 1). This first stage can simplify the meta-analytic data structure in the second stage, because only one effect size per ILSA or ILSA cycle is included—ultimately, this may result in more robust variance estimates (Declercq et al., 2020). At the same time, the first pooling stage discards the within-ILSA variation (e.g., across countries within ILSA cycles; Fig. 1b)—an important source of variation and heterogeneity (Van den Noortgate et al., 2013). Moreover, meta-analysts may face the challenge of including effect sizes that are based on very large ILSA samples which ultimately receive larger weights (Borenstein et al., 2009). Examining the sensitivity of the meta-analytic results with respect to including such effect sizes and diagnosing influential effect sizes become key steps in this approach (see Table 1; e.g., Scherer & Siddiq, 2019).

Effect size measures

The four presented approaches are all based on the assumption that the correct effect sizes have been extracted from the ILSA and the non-ILSA data. In this context, “correct” refers to effect size and sampling (co-)variance estimates in which the complex survey design was accounted for, especially the hierarchical structure of the ILSA data (Lai & Kwok, 2016; Tymms, 2004). For instance, when meta-analysts are interested in deriving the correct effect size measures for gender differences in achievement, the standardized mean difference (\(SMD\)) may be the effect size of their choice (Borenstein et al., 2009). When computing \(SMD\), the pooled standard deviation can incorporate information about the nesting of the primary study data (e.g., students nested in classrooms or schools; Brunner et al., 2022). Hedges (2007) proposed several ways to incorporate the intraclass correlation \({ICC}_{1}\) into the estimate of the pooled standard deviation. While such adjustments are available, they depend on the authors’ reporting of the relevant statistics, especially the intraclass correlation.

Besides the accounting for the nesting of the primary data, further elements may inform the estimation of the effect sizes, such as the use of sampling weights or performance assessment designs that draw from a set of plausible values (Rutkowski et al., 2010). Given that the raw primary data are oftentimes not available, meta-analysts may have to trust the estimation and reporting of the effect sizes in the publication and have hardly any chance to perform further adjustments. However, such adjustments are possible for most ILSA data—in fact, if the raw data of primary studies are available, the meta-analysts are in full control of the effect size estimation and can estimate them and the respective sampling (co-)variances from analytic models that incorporate the complex survey design features of ILSAs, such as multilevel models with sampling weights, stratifying variables, plausible values, and multi-group structures (Campos et al., 2023). Overall, meta-analysts have at least two options to address the complex survey design, especially the nested data structure, in primary studies: (a) Adjust the reported effect sizes by the \({ICC}_{1}\) (for details, please see Hedges, 2007); or (b) analyze the raw data (if available) via multilevel modeling (Kim et al., 2012).

The status of including ILSA data in meta-analyses of gender differences and SES gaps in student achievement

Substantive background

Educational research has long been concerned with examining and ultimately reducing gaps in educational outcomes between groups of students. Much of the discussion has centered around equity and equality in general (Espinoza, 2007), and the educational gaps associated with gender and socioeconomic status in particular (Berkowitz et al., 2017; Else-Quest et al., 2010). For instance, describing the SES-achievement relation in the domain of reading, PISA 2018 identified substantial variation in this relation across more than 70 educational systems (OECD, 2019). This ILSA also revealed cross-country variation in the gender gaps in reading achievement, yet with girls consistently outperforming boys. Similarly, other PISA cycles and ILSAs have mapped such gaps in student achievement across educational systems, age groups, subject domains, and over time and thus provide a rich data source for exploring their effect sizes, heterogeneity, and possible explanatory mechanisms (Broer et al., 2019; Gray et al., 2019).

To examine the extent to which ILSA data have been utilized to inform the meta-analytic body of knowledge and which approaches to including these data meta-analysts have taken, we systematically reviewed existing meta-analyses of the gender differences and SES gaps in student achievement. In this sense, the following two systematic reviews showcase the status of inclusion and inclusion approaches.


We used the systematic review methodology to identify the relevant studies within the scope of this paper, and followed the recommended steps, including predefining research questions, development of the search strategy, defining inclusion and exclusion criteria, screening, data extraction, appraisal, and synthesis (Higgins et al., 2019). In the following sections, we describe the application of these steps.

Search strategy

To retrieve the relevant meta-analyses, we developed a search strategy by first identifying the key terms for answering the aims of this study and identified the most commonly used synonymous for each term. We then performed two independent searches in the databases ERIC (Education Resources Information Center) and PsycINFO, combining search terms related to (a) the study design: meta-analysis or meta-analytic; (b) the outcome variable: achievement or performance or literacy or numeracy or reading or math* or science; and (c) the independent variable. For the latter, we used the search terms “gender difference* or sex difference* or gender gap” for meta-analyses of gender differences and “SES or socioeconomic status or socio-economic status or number of books or parent* education or parent* occupation or income or ESCS or HISEI or ISEI or possession* or capital” for meta-analyses reporting the relation between SES and student achievement. We extended these searches by hand-searching publications in key journals in the field (Educational Research Review, Review of Educational Research, Psychological Bulletin, Journal of Educational Psychology, Large-scale Assessments in Education) and the database PsyArXiv to identify possible preprint publications eligible for inclusion. Additional file 6: S6 contains the full search strategies, including the specific search terms. After removing duplicates, these searches yielded 318 publications for the gender meta-analyses and 271 publications for the SES meta-analyses (see Fig. 3).

Fig. 3
figure 3

PRISMA Flow Diagram of the Search, Screening, and Inclusion Processes of the Meta-Analyses. ILSA International large-scale assessment, SES Socioeconomic status

Screening and coding

The retrieved publications were then screened in two steps: First, we reviewed the abstracts for their topic fit, considering meta-analyses that were published in English between 1995 and 2020. Besides, the full texts of these publications must have been made available, the topic must have related to the designated content areas (i.e., gender differences in student achievement or relations between SES and student achievement), and the authors must have performed a meta-analysis—theoretical reviews, comments, methodological papers, and errata were excluded. This first step resulted in 19 published meta-analyses eligible for further screening for the gender meta-analyses and 36 publications for the SES meta-analyses (see Fig. 3). Second, we reviewed the full texts according to the following criteria:

  • Type of research question and data: The research question concerning gender differences in student achievement or the relation to SES are of correlational nature, and the data were observational. Exclude: Meta-analyses on the effectiveness of interventions.

  • Sample: ILSAs contain the student samples the meta-analyses focused on. Exclude: Meta-analyses focusing on children younger than primary school students, children with medical conditions or disorders, and children that were selected according to some criterion that could not be found in ILSAs (e.g., executive function scores).

  • Content and constructs: The constructs and contents of the meta-analyses were included in ILSAs. Exclude: Meta-analyses utilizing achievement or SES measures that were not assessed in ILSAs (e.g., working memory measures, school grades, parents’ income).

  • Direct relations: Direct relations between the constructs (i.e., gender or SES and student achievement) were reported. Exclude: Meta-analyses that use the key constructs as moderators (e.g., Peng et al., 2019).

  • Reported statistics: Independent of their inclusion, ILSA data could provide the statistics and effect sizes needed for the meta-analysis.

  • Inclusion criteria: Irrespective of their inclusion, ILSA studies fulfilled the inclusion criteria of the meta-analysis (i.e., would be eligible for inclusion). Exclude: Meta-analyses that focused on national large-scale assessments (e.g., Petersen, 2018).

These screening steps yielded eight gender meta-analyses and ten SES meta-analyses which set their inclusion and exclusion criteria so that ILSA studies had been eligible for inclusion (Fig. 3). A flowchart describing the screening decisions is shown in Additional file 6: S6.

The coding of these meta-analyses included key characteristics of the studies (i.e., publication year and status, number of studies and effect sizes, context), the measures (i.e., achievement domain, SES dimension(s), SES source(s), SES metric), the meta-analytic models (i.e., type of model(s), addressing the dependence structure, pooled effect size(s)), and the extent to which ILSA data were included (i.e., inclusion of ILSA data [yes/no], data sources, type(s) of ILSAs, cycle(s), inclusion approach, sensitivity analyses). Additional file 1: S1 and Additional file 2: S2 contain the detailed coding of the gender and SES meta-analyses, along with their pooled effects.


Meta-analyses of gender differences in student achievement

Overall, the m = 8 meta-analyses examining gender differences in student achievement included 448 studies, oftentimes operationalized as independent study samples, and yielded 6428 effect sizes in total (see Table 2). These meta-analyses covered the domains of reading (m = 5), mathematics (m = 6), science (m = 3), and digital literacy (m = 1). One meta-analysis was based only on non-ILSA primary studies to avoid redundancies with other meta-analyses (Lindberg et al., 2010), four only on ILSA data (Baye & Monseur, 2016; Else-Quest et al., 2010; Gray et al., 2019; Keller et al., 2022), and the remaining two meta-analyses included both ILSA and non-ILSA data (Lietz, 2006; Siddiq & Scherer, 2019). To a large extent, PISA and TIMSS data were included in the six meta-analyses that extracted information from ILSA data, followed by PIRLS, SACMEQ, and ICILS data. The two meta-analyses that included ILSA and non-ILSA data side-by-side took a one-stage direct inclusion approach, that is, the authors considered the participating countries and/or ILSAs to be studies yielding multiple effect sizes. None of these meta-analyses considered meta-analytic models with dependency structures—in fact, only one of the eight meta-analyses addressed such structures explicitly via multilevel meta-analysis (Keller et al., 2022). Only Siddiq and Scherer (2019) performed sensitivity analyses comparing the one-stage inclusion approach with a two-stage inclusion approach. The latter was based on two steps of meta-analysis: First, ILSA data were meta-analyzed, and the resultant weighted average effect size was extracted as a representative of the effects from ILSA studies. Second, this effect size was combined with the non-ILSA data and then meta-analyzed. Finally, the gender meta-analyses reported mainly standardized mean differences as effect sizes (m = 7), along with variance ratios (m = 2). In sum, six of the eight meta-analyses utilized ILSA data, only two of which directly included ILSA and non-ILSA data.

Table 2 Characteristics of the meta-analyses on the gender differences in student achievement

Meta-analyses of the relation between SES and student achievement

The sample of m = 10 meta-analyses describing the relation between SES and student achievement yielded 1631 effect sizes based on 556 studies (see Table 3). These effect sizes were mainly reported as correlations (m = 9) and in only one meta-analysis as a standardized mean difference. The meta-analyses covered a broad range of achievement domains, including literacy (m = 7), mathematics (m = 6), science (m = 6), general cognitive skills (m = 6), social sciences (m = 1), and digital literacy (m = 1), some of which were assessed not only by achievement tests but also school grades. The SES measures covered multiple dimensions, including parents’ income, occupation, and education, in all meta-analyses. Four meta-analyses were based on non-ILSA data and did not provide any reason for this exclusion (Harwell et al., 2017; Letourneau et al., 2013; Rodríguez-Hernández et al., 2020; Sirin, 2005), while six included both ILSA and non-ILSA data (Kim et al., 2019; Liu et al., 2020; Scherer & Siddiq, 2019; Tan, 2017; Tan et al., 2019; van Ewijk & Sleegers, 2010). None of the meta-analyses were based only on ILSA data. Primarily, the meta-analysts chose the PISA, TIMSS, ICILS, and SACMEQ data to inform their meta-analyses and consistently took a one-stage direct inclusion approach, considering the countries or ILSA cycles as separate studies. Two meta-analyses reported sensitivity analyses: Scherer and Siddiq (2019) compared the one-stage direct inclusion with the two-stage direct inclusion and examined the effects of excluding ILSA data; van Ewijk & Sleegers (2010) also examined the effects of excluding ILSA data. Accounting for the dependencies among multiple effect sizes per study, Liu et al. (2020) performed robust variance estimation, Scherer and Siddiq (2019) and Keller et al. (2022) conducted three-level meta-analysis, and van Ewijk and Sleegers (2010) modified the weights in a meta-regression model similar to the robust variance estimation. In sum, six of the ten SES meta-analyses included ILSA data next to non-ILSA data utilizing mainly the one-stage direct inclusion.

Table 3 Characteristics of the meta-analyses on the relation between socioeconomic status and student achievement

Summary of key findings

Our systematic review of meta-analyses on gender differences in student achievement and the relation between SES and achievement indicated that (a) ILSA data were not eligible for all meta-analyses on these topics, for instance, due to misfit of the target samples, types of achievement measures, or the focus on national rather than international assessment data; (b) several meta-analyses included ILSA data, yet to different degrees (i.e., ILSA data only, ILSA and non-ILSA data side-by-side); (c) meta-analysts mostly took the one-stage direct inclusion approach, yet hardly considered alternative approaches and sensitivity analyses; (d) the structure of the meta-analytic data sets with multiple effect sizes per study was hardly considered.

Illustrative example: Gender differences in digital literacy

In the following, we illustrate the application of the inclusion approaches and show how to implement them. Additional file 4: S4 and Additional file 5: S5 contain the R code, the detailed analytic steps (see also Table 1), and the respective results.

Meta-analytic data set and aims

Siddiq’s & Scherer’s (2019) original meta-analysis contained 23 primary studies yielding 46 standardized mean effect sizes and included the data from ICILS (International Computer and Information Literacy Study) 2013. We updated this meta-analysis by adding the openly available data from ICILS 2018 (Fraillon et al., 2020). We performed this update for several reasons: First, it increased the number of effect sizes and, ultimately, the statistical power to detect gender differences and possible moderator effects. Second, ICILS 2018 contained a different set of participating countries, and, by including it, we extended the range of educational systems, cultures, and languages to test some hypotheses on the moderating effects of cultural orientation (i.e., power distance index) and innovation (i.e., global innovation index). Third, meta-analysts may be able to include several ILSAs or ILSA cycles rather than only one. In this sense, our illustrative example mimics, to some extent, a typical inclusion scenario, and we use it to showcase the resultant complexities of the meta-analytic data.

Ultimately, this data set contained 24 primary studies and 59 effect sizes. Hedges’ \(g\) represented the standardized mean differences between girls and boys with positive effect sizes indicating higher performance scores for girls. In the original study, the authors aimed to quantify an overall effect size (\(\overline{g }\)), the between-study heterogeneity (\({\tau }^{2}\)), and the moderator effects of, for instance, test fairness (0 = test fairness was not examined, 1 = test fairness was examined) and publication type (0 = published, 1 = grey literature). Illustrating the inclusion approaches, we addressed these aims and further examined whether two country-level variables, Power Distance Index (PDI; see Hofstede, 2001) and the Global Innovation Index were additional moderators (GII; see Cornell University et al., 2020). Given that we relied on an updated data set with ICILS 2013 and 2018 data included, the meta-analytic models we used to estimate the weighted average effect size were more complex than the ones used in the original publication. Specifically, we used multilevel meta-analytic models and quantified multiple sources of heterogeneity—hence, we report multiple variance estimates at different levels of analysis (e.g., within studies \({\tau }_{(2)}^{2}\), between studies \({\tau }_{(3)}^{2}\), between countries \({\tau }_{(4)}^{2}\)). We represented the proportion of non-random variance that is due to heterogeneity by the \({I}^{2}\) value and the degree of inconsistency by Cochrane’s \(Q\) statistic (Borenstein et al., 2009). Additional file 1: S1 contains the data and describes how these variables were derived.

Separate meta-analysis

Performing separate meta-analyses via random-effects modeling, we obtained estimates of the weighted average effect sizes for the ICILS 2013, ICILS 2018, the combined ICILS 2013 and 2018, and the non-ILSA data sets. Table 4 shows these estimates, which ranged between \(\overline{g }\) = 0.12 and 0.21 and exhibited heterogeneity between the samples within these data sets. Notably, the overall effect size of the ICILS 2013 data was comparable to that of the non-ILSA data (z = − 0.3, p = 0.76); yet, the ICILS 2018 data showed a significantly higher overall effect (z = − 1.8, p = 0.07). The degree of heterogeneity varied between these data sets (see also Fig. 4): While the non-ILSA effect sizes varied substantially (\({\tau }_{(2)}^{2}\) = 0.033, \({I}^{2}\) = 95.4%), the effect sizes for the ICILS 2018 varied less (\({\tau }_{(2)}^{2}\) = 0.012, \({I}^{2}\) = 91.2%), and varied the least for the ICILS 2013 data (\({\tau }_{(2)}^{2}\) = 0.005, \({I}^{2}\) = 78.2%). We extended the random-effects model by adding the variables test fairness, publication status, power distance, and global innovation to the non-ILSA data. For the non-ILSA data, publication status negatively moderated the gender differences, with grey literature exhibiting smaller effects, and test fairness positively moderated these differences, with larger effects for studies examining test fairness (Table 5). For the ICILS 2018 and the combined ICILS data, more innovative countries exhibited significantly larger gender effects; this moderation effect was not apparent for ICILS 2013.

Table 4 Results of the random-effects meta-analyses of the gender differences in digital literacy
Fig. 4
figure 4

Forest plots of the effect sizes for the ILSA and non-ILSA data. Note. The weighted average effect sizes were based on common (two-level) random-effects (RE) models. Positive standardized mean differences (Hedges’ g) suggested that girls performed better than boys

Table 5 Results of the mixed-effects meta-regression analyses of the standardized mean differences across gender including study- and country-level moderators

Indirect inclusion via Bayesian meta-analysis

Utilizing the information from the separate meta-analyses, we conducted Bayesian meta-analysis for the non-ILSA data with informative priors on the weighted average effect and the heterogeneity estimates—these priors were based on the effect size and variance estimate of the combined ICILS 2013 and 2018 data (for the detailed specification of the priors, see Additional file 5: S5). The overall effect size was \(\overline{g }\) = 0.12, with a 95% confidence interval similar to the effect for the non-ILSA data and a between-sample variance of \({\tau }_{(2)}^{2}\) = 0.036 (Table 4). The Potential Scale Reduction Factors \({\widehat{R}}^{2}\) of the model parameters were all below 1.01, and the simulated distributions were similar to the observed distributions in the posterior predictive checks (see Additional file 5: S5). Moreover, the Monte Carlo Markov Chains showed a stable pattern without any clear trends or systematic changes over time and scattered around the model parameter estimates (see the trace plots in Additional file 5: S5). These observations supported that the meta-analytic model had converged and that stable estimates were obtained (Harrer et al., 2022). Moreover, varying the prior distributions did not show substantial sensitivity of the Bayesian effect size and variance estimates. Similar to the separate meta-analysis of the non-ILSA, the publication status moderated the gender differences in digital literacy; yet not the test fairness (Table 5).

One-stage direct inclusion

Directly combining the effect sizes obtained from ILSA and non-ILSA data resulted in a nested structure with multiple effect sizes per study. We therefore specified a three-level random-effects model addressing this structure (see Fig. 2b)—this model exhibited a significantly better fit to the meta-analytic data than a model ignoring the nesting (see Fig. 2a), \({\upchi }^{2}\)(1) = 10.0, p = 0.002. Moreover, the three-level model exhibited substantial within-study variation in addition to the between-study variation (see Table 4). The respective overall effect size was \(\overline{g }\) = 0.13 (95% CI [0.05, 0.21]) and showed significant heterogeneity (\({Q}_{E}\)[58] = 592.5, p < 0.001). Adding the potential moderator variables resulted in a significant effect of publication status (B = − 0.21, SE = 0.10, p = 0.04) and global innovation (B = 0.01, SE = 0.00, p < 0.001; see Table 5). Overall, about 46% of the between-sample and 2% of the between-study variation could be explained. Moreover, the difference in gender effects between ILSA and non-ILSA data was insignificant, B = 0.05, SE = 0.13, p = 0.72. In our example, random-effect models with RVE only identified the moderating effect of the publication type (Additional file 5: S5).

Given that some countries in the samples contributed multiple effect sizes (e.g., to the ICILS 2013, ICILS 2018, and non-ILSA data), an additional level of nesting may exist. To examine the degree of possible between-country variation in the effect sizes, we extended the three-level model to a four-level cross-classified random-effects model (see Fig. 2c). This model exhibited a better fit than the three-level model (\({\upchi }^{2}\)[1] = 4.2, p = 0.04) and showed that between-country variation existed, in addition to within- and between-study variation (see Table 4). The corresponding effect size was \(\overline{g }\) = 0.10, 95% CI [0.01, 0.18]. Similar to the three-level model, the effects of publication status (B = − 0.21, SE = 0.10, p = 0.04) and global innovation existed (B = 0.01, SE = 0.00, p < 0.001; see Table 5). However, this model showed that most variance could be explained at the country level (49.1%), yet not the study level (2.9%).

Two-stage direct inclusion

Utilizing the weighted average effect size and variance estimates of the separate meta-analyses, we combined the non-ILSA effect sizes with one overall ICILS 2013 and one overall ICILS 2018 effect size. Estimating the random-effects model without a nested structure, we obtained an overall gender effect of \(\overline{g }\) = 0.12 (95% CI [0.05, 0.19]; see Table 4), and the moderation effect of publication status (B = − 0.22, SE = 0.09, p = 0.01; see Table 5). The effect of test fairness was statistically significant, B = 0.17, SE = 0.08, p = 0.05 (see Table 5). This additional moderation effect suggested that larger effects were exhibited for studies that examined test fairness, after controlling for the interactivity of the assessment tasks and the publication status. Finally, pooling the ILSA effect sizes via Borenstein et al.’s (2009) procedure in the first stage did not show any different results: The weighted average effect size was \(\overline{g }\) = 0.12 (95% CI [0.05, 0.19]), and the two moderator effects persisted (publication status: B = -0.22, SE = 0.08, p = 0.01; test fairness: B = 0.17, SE = 0.08, p = 0.05). Further analyses neither flagged the large ILSA-data effect sizes as influential (see Additional file 5: S5).

Summary of key findings

Across the direct inclusion approaches, the overall effect sizes were consistently small and positive. Notably, these gender differences favored girls and tended to be smaller than in more curricular oriented domains such as mathematics, science, and reading (for specific ranges of effect sizes, please see Additional file 1: S1). All approaches revealed the heterogeneity of the gender effects. The cross-classified model represented the data best for the one-stage direct inclusion and highlighted three additional sources of heterogeneity (next to sampling variation): samples within studies, studies, and countries. Next to the consistency of the fixed effects, the moderator effects of publication status were almost identical in direction and magnitude. Some differences however existed for test fairness and global innovation: The one-stage inclusion approach identified the GII moderation effect and located it to the country level—these effects did not exist when synthesizing only the non-ILSA or ICILS 2013 data. The two-stage inclusion approach and the separate meta-analysis of the non-ILSA data further indicated moderation by test fairness.


Including ILSA data in meta-analyses in education

Our systematic review of the extent to which ILSA data were included in existing meta-analyses of gender differences or SES gaps in student achievement showed that ILSA data were not eligible for all meta-analyses. This may have been the main reason why their inclusion was limited. For instance, the seminal meta-analysis of gender differences in student achievement by Voyer & Voyer, (2014) focused solely on teacher-assigned grades as achievement measures and thus excluded ILSA data. Evaluating the eligibility of studies for inclusion also applies to ILSA data, and meta-analysts should carefully evaluate whether the ILSA samples, constructs, and study designs fit to their inclusion criteria and, ultimately, research purposes. Irrespective of the outcome of this evaluation, communicating the reasons for excluding ILSA data should be an integral part of the methodological rigor of meta-analyses in education (Pigott & Polanin, 2020). Moreover, given our review of the potential and the analytic opportunities associated with the inclusion of ILSA data in meta-analyses, we argue that searching the existing ILSA databases should become part of the meta-analytic standard procedures in education.

As noted earlier, one key issue of including ILSA data in meta-analyses lies in the methodological complexities these large-scale data may impose. As we have showcased while presenting the one-stage direct inclusion approach, the meta-analytic structure of the data that include ILSA and non-ILSA effect sizes can become complex, with hierarchical or even cross-classified structures. While modeling such structures may shed light on the possible sources of variation and the level at which moderators operate, the underlying meta-analytic models are advanced (Fernández-Castilla et al., 2020)—this may have been one reason why most meta-analysts refrained from addressing such complex data structures in their meta-analyses of the gender differences and SES gaps in student achievement. Our extension of the meta-analysis of the gender differences in digital literacy included two ILSAs and thus required meta-analytic models accounting for the multiple effect sizes per study and country. Meta-analysts should be aware which structure their meta-analytic data set including ILSA and non-ILSA data exhibits to obtain accurate estimates of fixed and random effects (Fernández-Castilla et al., 2020).

Another complexity is associated with the decision of which type of ILSA data are included, primary or secondary data? Given the availability of most ILSA data, meta-analysts do not need to rely on the results reported in secondary ILSA data analyses, yet can compute the effect sizes themselves. Although appealing, this opportunity requires that meta-analysts must be aware of the methodological complexities of the primary ILSA data and that they can address them analytically (Rutkowski et al., 2010). Hence, we see the need for training meta-analysts in both the analysis of primary ILSA data to derive the correct effect size estimates and the inclusion approaches for meta-analyses.

Concerning the four inclusion approaches, notably, our illustrative example showed consistently small estimates of the weighted average gender effect size. With the exception of the separate meta-analysis of the ICILS 2018 data, the estimates were comparable and did not lead to another conclusion. Nonetheless, we refrain from generalizing this result—in other context, with other measures and effect size, and for a different set of ILSA or non-ILSA data, the fixed effects may indeed vary considerably, especially when meta-analyzed separately (Gray et al., 2019). At the same time, some specifications within the inclusion approaches were homogeneous. For instance, the overall gender effects were identical for the separate meta-analysis of the non-ILSA data and the indirect inclusion in our study—in fact, both approaches focused on the non-ILSA data and differed only in the extent to which information from the ILSA data was incorporated (e.g., Röver, 2020). Similarly, the different direct inclusion approaches agreed on the size of the pooled effect.

Recommendations for including ILSA data in meta-analyses

Considering the marginal differences in meta-analytic findings in our illustrative example, meta-analysts may well argue that the choice of the specific approaches may not matter for the reporting of the overall effects. However, some of these approaches are more useful than others, especially for quantifying the sources of variation and the moderator effects (Fernández-Castilla et al., 2020), and we recommend that meta-analysts choose an approach in light of the goals of their study.

First, we recommend to meta-analysts who wish to compare the effects obtained from non-ILSA studies to ILSA data to conduct separate meta-analyses of these two types of data. This approach facilitates the benchmarking and interpreting of effect sizes from non-ILSA data (Wagemaker, 2016). Moreover, we argue that conducting separate meta-analyses could also provide initial insights into the potential similarities and differences of effects across data sources and may, at the least, serve as form of robustness check for the other approaches.

Second, if the purpose of a meta-analysis is to synthesize evidence from non-ILSA data sources (e.g., due to some substantively motivated inclusion criterion), we recommend considering an indirect inclusion approach. Without influencing the core meta-analytic findings or choices of data, such an approach can inform and potentially improve the estimates of the heterogeneity estimates by incorporating the knowledge about such parameters in ILSAs (Brunner et al., 2018).

Third, in situations where the heterogeneity and possible moderator effects for ILSA and non-ILSA data are the primary interest, we recommend taking a direct inclusion approach. Both the one- and two-stage direct inclusion can shed light on between-study heterogeneity and moderation by study-level features. The one-stage approach can further include between-country heterogeneity and country-level moderation effects (see also Cheng et al., 2018). Via direct inclusion, meta-analysts can test specific hypotheses on which factors at which levels of analyses may explain the heterogeneity of the effects. Moreover, they can compare directly via subgroup or moderator analyses to what extent the type of data (i.e., ILSA vs. non-ILSA data) also explains heterogeneity. In this sense, the direct inclusion approaches offer several analytic possibilities to quantify and explore heterogeneity, which is why we considered them to be the preferred choice in meta-analyses in education.

Each of the steps within the inclusion approaches should be documented, and the analytic decisions within justified (Pigott & Polanin, 2020). Once the eligible primary studies and ILSA data sets have been identified, the following analytic aspects are key when meta-analyzing non-ILSA and ILSA data side-by-side:

  • Generate the effect sizes from the primary data incorporating the complex sampling survey design features. As noted earlier, the correct effect size and sampling variance estimates must be derived from both the non-ILSA and ILSA data. For the latter, both adjustments of effect sizes and the re-analysis of the raw data are largely available—the ILSA official reports already contain some effect sizes that are based on the complex survey design (e.g., gender differences, relations between SES and achievement). Meta-analysts should clearly communicate the ways in which they derived the effect size measures, their sampling variances and covariances, and how they dealt with the complex survey design features of ILSAs, such as weighting, multi-stage and cluster sampling, rotated questionnaire designs, and stratification. Moreover, if IPD sets are analyzed and model-based effect sizes are estimated, the analytic modeling procedures should be mimicked across ILSAs or ILSA cycles, so that effect sizes are comparable and have the same meaning.

  • Indicate the structure of the meta-analytic data. Despite the nested structure of the primary data (e.g., students nested in classrooms or schools), meta-analytic data can also follow complex structures (e.g., multiple effect sizes nested in studies or ILSAs; see Fig. 2). To derive overall estimates of a weighted average effect, meta-analytic models that account for this structure are needed (Fernández-Castilla et al., 2020). Meta-analysts should identify the structure of their data and select the respective meta-analytic models (e.g., multilevel meta-analysis, robust variance estimation). Selecting one effect size per ILSA is not recommended.

  • Choose an inclusion approach based on the research questions and goals. As we reviewed the inclusion approaches in our framework, we identify both their strengths and weaknesses. Meta-analysts should carefully consider them and decide for an approach in light of their research questions and purposes. For instance, if only small-scale primary studies are in the focus, ILSA data may only inform the meta-analysis via an indirect inclusion approach. If differences between studies with random versus convenience samples are in the focus, both non-ILSA and ILSA data may inform the meta-analysis via a direct inclusion approach.

  • Conduct sensitivity analyses. Sensitivity analyses can shed light on the impact the inclusion of ILSA data in the meta-analysis of non-ILSA data may have on the substantive findings and estimates. Moreover, they indicate the robustness of the specific inclusion approach researchers have taken.

  • Report the analytic steps and decisions transparently. We encourage meta-analysts to document each of the analytic steps and decisions and share their analytic code to facilitate transparency and possible updates of their meta-analyses. This is especially relevant for replicating the model-based generation of effect sizes accounting for the complexities of the primary ILSA data (IPD) and the meta-analytic models accounting for the complexities of the secondary (meta-analytic) data.

Limitations and future directions

The present study has several limitations: First, the two systematic reviews provide information about the inclusion of ILSA data in meta-analyses for the two selected topics (i.e., gender differences and the relation between SES and achievement). Although these topics concern key issues in education (e.g., OECD, 2016), especially in the context of equity and equality, the respective findings may not be fully generalizable. In this sense, we encourage researchers to consider extending these reviews into other, educationally relevant topics.

Second, our study reviewed the advantages and challenges associated with the application of four inclusion approaches, yet did not examine their performance in large-scale meta-analyses and simulations. Knowledge about their performance, especially their efficiency, bias, and the precision of the meta-analytic estimates, could further guide the decisions for one or the other approach.

Third, our review focused on situations in which ILSA and non-ILSA data are combined. However, in practice, meta-analysts may also face situations in which only ILSA data are combined meta-analytically, for example, from multiple ILSAs and ILSA cycles. Such situations offer the possibility to generate effect sizes and sampling (co-)variances from the same kind of analytic model. Recently, some ways to meta-analyze only ILSA data have been proposed (Brunner et al., 2022; Campos et al., 2023) with respective examples (e.g., Blömeke et al., 2021; Keller et al., 2022).


Overall, we argue that ILSA data hold great potential for informing meta-analyses in education, especially due to their rigorous study and sampling designs, the availability of indicators describing educational systems at multiple levels, and their focus on key issues and constructs in education. This potential may not only assist meta-analysts in expanding their data sets and ultimately improve the precision of the meta-analytic estimates, but also reduce possible publication, cultural, and methodological bias. Another key advantage is that the primary ILSA data are almost entirely available to meta-analysts, who can define and implement the analytic models themselves, yielding effect sizes based on complex survey design directly. At the same time, including ILSA data requires a careful choice of an appropriate methodological approach and may extend the analytic steps involved in a meta-analysis by further sensitivity and moderator analyses. Moreover, the complex structure of both the primary ILSA and the resultant meta-analytic ILSA and non-ILSA data must be addressed.

Our paper describes four ILSA data inclusion approaches, outlines the steps meta-analysts may take to examine the possible effects of including ILSA data in their meta-analyses, and provides information on their potential, challenges, and fit to the specific research purposes. We believe that this framework of approaches informs and stimulates the inclusion of ILSA data in meta-analyses on key issues in education to ultimately improve the quality, precision, and informativeness of research evidence.

Availability of data and materials

The IPD analysed during the current study are available in the IEA ICILS data repository, The meta-analytic dataset can be accessed via the Open Science Framework (OSF) at and is made available as supplementary material. The manuscript has been made available as a preprint via MetaArXiv at



Education resources information center


Economic, social and cultural status


Global innovation index


Highest international socio-economic index of occupational status


Intraclass correlation coefficient


International computer and information literacy study


International association for the evaluation of educational achievement


International large-scale assessment


Individual-participant data


Studies other than international large-scale assessments


Organisation for economic co-operation and development


Power distance index


Progress in international reading literacy study


Programme for international student assessment


The Southern and Eastern Africa Consortium for Monitoring Educational Quality


Socioeconomic status


Standardized mean difference


Teaching and learning international survey


Trends in international mathematics and science study


United Nations Educational, Scientific and Cultural Organization


Download references


Not applicable.


This work was supported by the Centres of Excellence scheme, funded by the Research Council of Norway, project number 331640.

Author information

Authors and Affiliations



RS: Conceptualization, methodology, software, formal analysis, investigation, resources, data curation, writing-original draft, writing-review & editing, visualization; TN: Writing-original draft, writing-review & editing; FS: Writing-original draft, writing-review & editing.

Corresponding author

Correspondence to Ronny Scherer.

Ethics declarations

Ethics approval and consent to participate

This review article utilizes the primary study data of the IEA ICILS and the secondary study data published in the eligible articles. Ethics approval and consent to participate concerning ICILS were organized and given by the national ICILS centres in the participating countries and conformed to the IEA ethical standards. For more details, please refer to the respective ICILS documentation (e.g., the ICILS 2013 and 2018 technical reports). The secondary study data were based summary data of the primary study data sets, so that no additional ethics approval or consent were required.

Consent for publication

We provide our consent to publish this manuscript upon acceptance in the Springer open-access journal “Large-scale Assessments in Education”. No further consent is required.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Gender Differences in Academic Achievement.

Additional file 2.

The Relationship between Socioeconomic Status and Academic Achievement.

Additional file 3.

Primary Study Data.

Additional file 4.

Meta-Analysis of Gender Differences in Digital Skills: Separate Meta-Analyses.

Additional file 5.

Meta-Analysis of Gender Differences in Digital Skills: Direct and Indirect Inclusion Approaches.

Additional file 6.

Search Strategies, Screening, and Included References.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Scherer, R., Siddiq, F. & Nilsen, T. The potential of international large-scale assessments for meta-analyses in education. Large-scale Assess Educ 12, 4 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: