Teacher-centered analysis with TIMSS and PIRLS data: weighting approaches, accuracy, and precision

This paper extends existing work on teacher weighting in student‑centered surveys by looking into aspects of practical implementation of deriving and using weights for teacher‑centered analysis in the Trends in International Mathematics and Science Study (TIMSS) and the Progress in International Reading Literacy Study (PIRLS). The formal conditions to compute teacher‑centered weights are detailed, including math‑ ematical equations. We provide a proposal on how to define the targeted populations as well as how to collect data that is needed to derive teacher‑centered weights, yet currently unavailable. We also tackle the issue of teacher nonresponse by proposing a respective adjustment factor, as well as mentioning the challenge of multiple selec‑ tion probabilities when teachers teach in multiple schools. The core part of the paper focuses on studying the level of accuracy that can be expected when estimating teacher population characteristics. We use TIMSS 2019 data and simulate likely scenar‑ ios regarding the variance in weights. The results show that (i) the different weighting scenarios lead to relatively similar estimates; however, the differences between the sce‑ narios are sufficient to justify the recommendation to use correctly derived teacher weights; (ii) differences between estimated standard errors based on complex sam‑ pling and corresponding estimates based on simple random sampling are sufficiently consistent to support use of a procedure to estimate standard errors that accounts for both sample weights and the complex sampling design; (iii) sample sizes and vari‑ ance in weights significantly limit estimate precision, so that total population estimates with sufficient precision are available in the majority of countries but subpopulation features are generally not sufficiently precise. To provide a critical evaluation of our results, we recommend implementation of the proposed method in one or more countries. This recommended study will permit examination of logistical considera‑ tions in implementation of required changes in data acquisition and will provide data to replicate the analysis with teacher‑centered weights.


Introduction
Many contemporary international large-scale assessments (ILSA), for example the Trends in International Mathematics and Science Study (TIMSS, Martin et al., 2020), the Progress in International Reading Literacy Study (PIRLS, Martin et al., 2017), and the Programme for International Student Assessment (PISA, OECD, 2019a), investigate student populations.Others cover teachers, the most prominent one is the Teaching and Learning International Survey (TALIS, OECD, 2019b).There is a third type of ILSA that attempts to cover both teacher and student populations within one study, requiring compromises regarding the optimization of the sampling designs.Examples for such studies are the International Civic and Citizen Study (ICCS, Schulz et al., 2018) and the International Computer and Information Literacy Study (ICILS, Fraillon et al., 2020), which both target eighth grade students and their teachers and aim for fully representative samples for both groups.While this solution sounds intriguing and cost-efficient, it comes with a severe disadvantage, that is, there is no direct linkage between teachers and students, hence, for example, teachers' attitudes and teaching styles cannot be related directly to their students' characteristics and outcomes.
TIMSS and PIRLS are among the most well-known ILSAs in the world, with more than 50 participating countries and educational systems.Since 1995, TIMSS every four years has investigated attainment in mathematics and science of students in fourth and eighth grades.Since 2001, PIRLS every five years has studied reading literacy of students in fourth grade.A rich array of contextual information is gathered in both studies from both the students themselves and individuals involved in students' learning: school principals, parents, and teachers of the sampled students.Even though TIMSS and PIRLS are designed to provide information on student learning, and analyzing teacher-level characteristics is not part of the studies' analytical objectives, scholars are interested to use the information that is collected from teachers.However, analyzing teacher data from these studies is not straightforward.In this paper, we consider TIMSS 2019.This survey provides summary results for teachers on variables ranging from years of experience to job satisfaction for different educational systems, subjects taught, and grade taught.For example, the average years of experience in Albania of a student's mathematics teacher in Grade 4 is estimated to be 22 (Mullis et al., 2020, page 390).This average does not necessarily estimate in Albania the mean years of experience of a mathematics teacher for Grade 4. Instead the reported average estimates a weighted mean of years of experience.For a given teacher, the weight is a sum over students taught in a given grade and subject of the fraction of instruction provided.In the Albanian example, each student has only one mathematics instructor, so that the weight is proportional to the number of students taught.The TIMSS 2019 User Guide (Fishbein et al., 2021) warns users of the TIMSS 2019 database of this difference between these two averages: The teachers in the TIMSS 2019 International Database do not constitute representative samples of teachers in the participating countries.Rather, they are the teachers of nationally representative samples of students.Therefore, analyses with teacher data should be made with students as the units of analysis and reported in terms of students who are taught by teachers with a particular attribute.(Fishbein et al., 2021, p. 13) This warning reflects two distinct issues.The sampling design does not ensure that sampled teachers are a representative sample of all teachers in an educational system, and the data collection does not permit a weighting adjustment to allow use of the sampled teachers to estimate mean characteristics of the population of teachers.
Although TIMSS emphasizes assessment of achievement of students, in line with Hooper et al. (2022), we argue that simple modifications of forms provided by participating schools permit development of teacher-centered sampling weights that allow use of the sample of teachers in TIMSS and PIRLS for estimation of means of characteristics of the teacher populations of participating educational systems.
In this paper, we will start by proposing a teacher population definition for the surveyed grades and subjects in TIMSS.Next, we will briefly review weighting in TIMSS and current inferences that implicitly use sample student weights to provide sample teacher weights.We refer to these weights hereafter as student-centered teacher weights (s-tchwgt).They are useful for research questions dealing with the relationship between teachers and students.By revisiting the results from Hooper et al. ( 2022), we will then introduce sample teacher-centered teacher weights (t-tchwgt) that can be used if the interest is on teachers themselves rather than on their students.
Thereafter, we will apply the findings of Hooper et al. (2022) to determine how to obtain the information needed to derive teacher-centered weights and how to examine accuracy of estimates based on t-tchwgt.Because the current data from TIMSS and PIRLS do not now permit application of the approaches proposed by Hooper et al. ( 2022) from a theoretical perspective, results of a simulation study will be presented examining the expected precision of the proposed teacher-centered estimates.To inform this simulation, we considered existing data from TIMSS 2019.Complications such as weight adjustments for non-response and multiple chances of selections when teachers teach in multiple schools will be considered.The paper will close with conclusions concerning the feasibility in practice of teacher-centered estimates and with recommendations concerning implementation of such estimates.
Because TIMSS and PIRLS use the same sampling design (Joncas and Foy, 2012), the findings of this research are fully applicable to other iterations of TIMSS, and to PIRLS.The notation we use for our paper can be found in Table 14.

Defining international target populations of teachers for TIMSS and PIRLS
The introduction of revised teacher weights in TIMSS will facilitate analyses on the teacher level without the need to use students as units of analysis and reporting.To draw direct conclusions about a teacher population with equally weighted teachers, it is important to agree on an unambiguous definition of this population.This section attempts a proposal for such definition in line with the assumptions in the remainder of this paper.According to the authors' knowledge, there is no explicit definition of the population of teachers in either TIMSS or PIRLS.However, as specified in the TIMSS technical documentation (LaRoche et al., 2020), TIMSS invites all mathematics and science teachers of the selected classes to participate.The same applies for reading/language teachers of the participating PIRLS classes (Martin et al., 2017).To allow the current selection mechanism to align with the procedures proposed in this paper, we suggest to include all mathematics and science teachers who instruct students in the target grade, i.e., fourth and/or eighth grade for TIMSS, and all reading/language teachers of fourth-graders for PIRLS.The proposed definition corresponds to the following TIMSS and PIRLS international target population definition of students: 1

Fourth grade (TIMSS and PIRLS)
All students enrolled in the grade that represents four years of schooling counting from the first year of ISCED Level 1, providing the mean age at the time of testing is at least 9.5 years (LaRoche et al., 2020, sect. 3.4) Eighth grade (TIMSS only) All students enrolled in the grade that represents eight years of schooling counting from the first year of ISCED Level 1, providing the mean age at the time of testing is at least 13.5 years (LaRoche et al., 2020, sect. 3.4) To these student target populations correspond four distinct teacher target populations in TIMSS: mathematics teachers of fourth-grade classes, science teachers of fourthgrade classes, mathematics teachers of eighth-grade classes, and science teachers of eighth-grade classes; and one teacher target population in PIRLS: reading/language teachers of fourth-grade classes, as follows: Fourth grade (TIMSS and PIRLS; mathematics, science, and reading/language teachers) All teachers teaching mathematics [science, reading/language] to students enrolled in the grade that represents four years of schooling counting from the first year of ISCED Level 1, providing the student mean age at the time of testing is at least 9.5 years (LaRoche et al., 2020, sect. 3.4) Eighth grade (TIMSS only; mathematics and science teachers) All teachers teaching mathematics [science] to students enrolled in the grade that represents eight years of schooling counting from the first year of ISCED Level 1, providing the student mean age at the time of testing is at least 13.5 years (LaRoche et al., 2020, sect. 3.4) It is important to note that the teacher target populations are not mutually exclusive; e.g., a mathematics teacher of fourth-grade students can also be a science teacher of eighthgrade students, or a teacher might teach multiple subjects to the same class.Moreover, teachers can teach at different schools.All teachers are considered equally, regardless of the hours taught.We further suggest to define the subjects science and mathematics based on the content domains of the assessment.Thus, subjects related to mathematics must cover at least one of the following content domains: number, measurement, geometry, algebra, data, or probability (Lindquist et al., 2017).Subjects related to science must cover at least one of the following content domains: life science, biology, chemistry, physical science, physics, or earth science (Centurino and Jones, 2017).Even though we have tried to give as accurate a definition as possible, there may still be contested cases.For example, if several teachers teach the same subject to the same class, the general rule is that all teachers are part of the target population.We propose that a teacher associated with a class is not considered part of the target population only if one of the following conditions applies: the teacher is not at all involved in instructing the students, the teacher clearly only has a supporting role, the teacher is in training, or the teacher's role in delivering instruction is otherwise very limited.Furthermore, in accordance with the proposed definition, teachers who do not teach the respective target grade and/or subject during the TIMSS testing period are not considered part of the target population.
Due to the multistage sampling procedure of TIMSS and PIRLS, the listing of teachers is inter-related with the sampling of schools and classes.In order not to jeopardize the core objectives of the studies and to keep procedures simple and cost-efficient, exclusion criteria for teachers must align with the exclusion criteria for schools and classes.Thus, teachers are excluded if they only instruct students in excluded schools or excluded classes.For instance, to a limited extent, TIMSS and PIRLS permit countries to exclude very small schools.At the class level, participating countries are allowed to exclude classes in which all students are either non-native speakers or have functional or intellectual disabilities.

Weighting in TIMSS
In this section, we will summarize the usual sampling procedures applied in TIMSS (Joncas and Foy, 2012), as this knowledge is built upon in the following sections.
In TIMSS, multistage sampling is used to obtain student samples for assessment of achievement in mathematics and science in the fourth and eighth grade (LaRoche et al., 2020).This procedure is not designed to facilitate sampling of teachers.To consider procedural changes to facilitate inferences on teachers, we examine the sampling procedure used in TIMSS for an educational system with N schools, H strata, C classes, and S students in the target grade.At the initial stage, within stratum h, schools are sampled with probability proportional to size (PPS), where ideally the size measure for a school i is defined as the number of students S hi in the target grade.A school and two replacement schools are selected simultaneously from the N h schools in the stra- tum.The original school is used if it participates.The first replacement school is used if the original school does not participate but the first replacement school does.The second replacement school is used if neither the original school nor the first replacement school participates but the second replacement school does.After adjustments for nonresponse, participating sampled school i from explicit stratum h has a sampling weight F hi1 = A h1 M h /(n h m i ) .This weight involves the size measure m i for sampled school i, the sum M h of size measures for all schools in stratum h, and the school non-participation adjustment A h1 for stratum h.For stratum h, the adjustment A h1 depends on the num- ber n h of participating sampled schools and the number n hnr of cases in which neither the originally sampled school nor its two replacement schools participated.The adjustment A h1 = (n h + n hnr )/n h .If schools in stratum h are certain to participate, then A h1 is always 1 and the inverse of F hi1 is the exact probability that school i participates.The mechanisms used in TIMSS for adjusting nonresponse are based on the assumption that observations are missing at random within the adjustment cells.However, since this assumption cannot be definitively proven, strict requirements on participation rates are enforced Meinck (2015a).Within a school, classes are usually randomly drawn with equal probability of selection, and the class then has a weight inversely proportional to its probability of selection.As in the case of sampled schools, adjustment is made for non-participation.Let δ i be the number of participating classes in school i out of the number c i of sampled classes.Let C i be the total number of eligible classes in school i.Let A h2 , the class non-participa- tion adjustment for stratum h be n h divided by the sum over participating schools i in the stratum of the class participation fractions δ i /c i .The class weight component for sam- pled class j of sampled school i is then F hij2 = A h2 C i /c i .The overall weighting of class j of school i is G hij2 = F hi1 F hij2 .The inverse of G hij2 estimates the joint probability that school i and class j are both sampled and participate.
In some cases, classes within schools are divided into strata, and classes are randomly selected within strata.This approach could be used, for example, if schools have classes with different language of instruction, and they aim for a specific sample size for both languages.Such stratification of classes within schools is used by some countries in recent TIMSS and PIRLS studies.Simple changes in arguments must then be made.
Within classes, let n ij be the number of students in the class, let n ij1 be the number of selected students in the class, let n ij3 be the number of selected students in the class who participate, and let n ij2 be the number of students sampled who might have par- ticipated.(It is possible due to class changes that n ij2 and n ij1 differ.)Students who are selected and participate receive weight component The inverse of G hij3 is the estimated joint probability that student k is a sampled and participating member of sampled and participating class j from sampled and participating school i.If non-participation does not exist for schools and classes and all students in a class are sampled, then the student weight G hij3 reduces to M h C i /(n h m i c i ) .TIMSS also allows subsampling of students within classes.In this case, classes are sampled with PPS and students within classes are sampled with systematic simple random sampling (systematic SRS).This procedure was however used exclusively for Singapore during the last cycles of the studies.For simplicity we do not extend the paper for this special case; however, such an extension is straightforward.Let Y be a real student measurement variable with value Y ijk for student k from class j of school i, and let Ȳ be the mean of the S values of Y.The estimated mean Ȳs is then the ratio estimate with numerator equal to the sum of G hij3 Y ijk over observed students k, classes j, and school i for which Y ijk is available and denominator equal to the corresponding sum of G hij3 over observed students k, classes j, and school i for which Y ijk is available (Hájek, 1971).

Two types of teacher weights: student-and teacher-centered weights
Scholars familiar with the TIMSS data will be aware that teacher weights are already provided in publicly-available data files.In this research paper, however, we distinguish two types of teacher weights.The teacher weights that are already available are linked to the students of the responding teachers.These weights are labeled teacher weights (TCHWGT ) in the TIMSS 2019 data base.To emphasize their relation with the student population, we call these weights student-centered teacher weights (s-tchwgt).If s-tchwgt is used, students are the units of analysis.These weights are derived by dividing the final student estimation weight by the number of teachers related to an individual student.For example, suppose a student has a final weight of 10 and two science teachers.In this case, the student dataset is duplicated and merged to the data of both teachers, and s-tchwgt for each case in the resulting file has a value of 10/2=5.As pointed out in the introduction, this weight is useful to describe average features of target grade students.It allows statements such as: "50 percent of students in country X have science teachers with a postgraduate degree." The second type of teacher weights, which are the subject of this research paper, provide an approach for teacher-centered analysis and will be named teacher-centered teacher weights (t-tchtwgt).With reference to the above example, t-tchwgt could be used to estimate the number of science teachers in the targeted teacher population who completed a postgraduate degree.In the following section, we will present the issue in a more formal way.
To describe the current student-centered teacher weights in TIMSS, consider a teacher variable U with value U it for teacher t in target school i for a specific sub- ject (mathematics or science).We begin with the student-centered case.For each student k in class j of school i, let K ijk be the number of teachers the student has for the subject under consideration.Let the student-centered population weight W it of teacher t in school i be the sum of the fractions 1/K ijk for all students k in a class j who are taught by teacher t.The student-centered population mean ŪW of the teacher var- iable U is the ratio with numerator equal to the sum of the products W it U it for teach- ers t in target schools i and denominator equal to the corresponding sum S of the weights W it .Recall that the target population has S students.The population mean ŪW is also the population mean over all students k in classes j in schools i of the aver- age of the U it for the K ijk teachers t who instruct the student.For sampled teacher t of sampled and participating school i, let the student-centered sampling weight W its be the sum of G hij3 /K ijk over sampled and participating students k from sampled and participating classes j of school i who have teacher t.Then the student-centered estimated mean ŪWs is the ratio with numerator equal to the sum of the products W its U it over sampled teachers t from sampled and participating schools i for whom U it is observed and denominator equal to the sum of the W its over sampled teachers t from sampled and participating schools i for whom U it is observed.The estimates ŪWs are used in TIMSS.
In the case of teacher-centered weights, let D i be the number of teachers in school i for a targeted subject, let D + be the sum of the D i over all target schools i, and let �(U ) be the total of the U it for the D + teachers t in target schools i.The teacher-based mean Ū of the teacher variable U for teachers t in target schools i is just the sample mean of the U it over teachers t in schools i.With current data, Ū cannot be estimated.Nonetheless, it is possible to consider how Ū and ŪW compare.To aid in compari- son, let V it = D + W it /S be the adjusted student-centered population weight, so that the average V of the V it is 1.Then Ū is the average of the products V U it , while ŪW is the average of the products V it U it .If either the student-centered population weights W it are constant, so that each W it is the average number S/D + of students per teacher, or the variables U it are constant, so that each U it is Ū , then ŪW and Ū are equal.Argu- ments here are most appropriate if no teachers teach the same target subject in the same grade at more than one school.Otherwise, some modifications are required.
To establish an upper bound on the difference | ŪW − Ū | for the case in which nei- ther the teacher variables U it nor the student-centered population weights W it are con- stant, let σ (U ) be the population standard deviation of the teacher variables U it for teachers t in target schools i, so that σ (U ) is the square root of the mean of the squared deviations (U it − Ū ) 2 , and let σ (W ) be the corresponding population standard devia- tion of the student-centered weights W it for teachers t in schools i.By assumption, both σ (U ) and σ (W ) are positive.Let the population correlation coefficient of the U it and W it be ρ(U , W ) .The difference between ŪW and Ū is the average of the products (V it − 1)U it .Because the average of the differences (V it − 1) is 0, the average of the prod- ucts (V it − 1) Ū is also 0. Thus the difference ŪW − Ū is the average of the products Thus a small absolute relative difference | ŪW − Ū |/σ (U ) results if either the standard deviation of the adjusted weight variables V it is small or the absolute value of the cor- relation coefficient of the V it and U it is small.If all classes in the target population have only one teacher for the subject of interest and all teachers teach the same number of students, then this standard deviation is 0.

Teacher-centered inference: methods
A simple change in data collection permits direct study of teachers of students in the target population (Hooper et al., 2022).The key is to record, for each sampled teacher in a particular grade and subject in a participating school, the total number of classes taught by that teacher in the same school, subject, and grade.In this way, two approaches described herein have been proposed to estimate the distribution of teacher variables in the target population of teachers (Hooper et al., 2022).Horwitz-Thompson estimation (Horvitz and Thompson, 1952), which is abbreviated as HT, is a traditional method to obtain unbiased estimates of sums of population variables under sampling without replacement.The other approach, multiplicity-adjusted indirect sampling (MAIS), provides simplified analysis that involves possible multiple-counting of the same teacher.Both approaches lead to unbiased estimation of sums of teacher variables in the target population if non-participation adjustments are not required.HT has the advantage of fixed weights but requires simple random sampling of classes within schools.MAIS has the advantage of applicability to sampling of classes by methods not equivalent to simple random sampling.In addition, MAIS is much easier to describe, so that it will be emphasized in applications.Theoretical results are derived for variances and their estimates for both the HT and MAIS approaches, however, due to its wider applicability, the MAIS approach will be used to obtain indications of the potential accuracy of estimated means of teacher variables for individual educational systems.
Because the information required for the analysis is not currently obtained in TIMSS, analysis considers plausible scenarios for teacher weights rather than direct use of teacher weights.In addition to consideration of variances, this paper also treats (1) problems of teacher non-response via approaches similar to those used in TIMSS for student non-response, class non-response, and school non-response.
In both approaches under consideration, the procedure for sampling classes is the standard one in TIMSS.The two approaches HT and MAIS diverge once classes are sampled.Let d it of the C i classes be taught for a given subject, mathematics or sci- ence, at least in part by teacher t, and let d its of the c i sampled classes be taught by that teacher.Let δ it be the number of sampled teachers who participate in school i.Let the teacher non-participation adjustment A ht in stratum h be n h divided by the sum over participating schools i in the stratum of the fractions δ it /d it .
As in the development of student-centered weights, let D i be the number of teach- ers t in the school, and let D + be the sum of the D i over schools in the target popula- tion.The challenge is estimating Ū by use of the participating teachers t associated with the c i classes sampled from each sampled school i.
To describe the HT approach to teacher-centered weights, consider computing the probability that a teacher t from school i is in a sampled class given that school i has been sampled.If c i classes are sampled randomly, so that C i − c i classes are not sam- pled, then the probability T it that teacher The formula for C i − c i < d it applies because it is impossible in this case for teacher t not to be sampled.The alternative case holds since the product of C i − d it − a over non-negative integers a < c i is the number of ordered samples of classes of size c i that do not include teacher t and the product of C i − a over non-negative integers a < c i is the total number of ordered samples of classes of size c i .In the simplest case, c i = 1 and T it = d it /C i .Then the sampling weight W itH = F hi1 A ht /T it for participating sam- pled teacher t from school i.The teacher-centered sample mean ŪH based on the HT approach is then the ratio estimate with numerator equal to the sum of the products W itH U it over participating sampled teachers t in participating and sampled schools i for which U it is observed and denominator equal to the sum of the W itH over participating sampled teachers t in participating and sampled schools i for which U it is observed.As expected from Horwitz-Thompson estimation, for a school i with no non-participation of teachers and all U it observed for sampled teachers, the sum of U it /T it over sampled teachers t estimates the sum U i+ of U it over all targeted teachers t in the school.The sum of the products W itH U it over sampled and participating teachers t in sampled and par- ticipating schools i then estimates the sum of the U it over all teachers t in schools i from the target population.
In the MAIS approach, the sample weight W itM = G ih2 A ht d its /d it if teacher t is sam- pled and participates in sampled and participating school i.The teacher-centered sample mean ŪM based on the MAIS approach is then the ratio with numerator equal to the sum of the products W itM U it over participating sampled teachers t in partici- pating and sampled schools i for which U it is observed and denominator equal to the sum of the W itM over participating sampled teachers t in participating and sampled schools i for which U it is observed.If d it > 1 for a sampled teacher t in school i, then (2) the count d its and the sample weight W itM are not constant.Nonetheless, d it c i /C i is the expected value of the number d its of times teacher t teaches a sampled class.This expected value is also the product of the probability T it that d its > 0 and the expected value of d its given that d its > 0 .It follows that d its given that d its is positive has expected value d it c i /C i , so that W itM and W itT have the same expected value given selection of teacher t.As a consequence, both the MAIS and HT approaches provide comparable estimates of the teacher-centered mean Ū .Although the simpler form of the MAIS estimate is an attraction in a comparison with the HT estimate, a more important consideration is that MAIS can be employed when simple random sampling of classes is not present as long as the expected value of d its is d it c i /C i .The HT approach must be modified if simple random sampling of classes is not employed within schools.
In a number of cases, the HT and MAIS approaches coincide.If, for all schools i, either the number of sampled classes c i is 1, c i = C i , or the number d it of classes each teacher t instructs is always 1, then W itH = W itM for all sampled and participating teachers t and ŪM = ŪH .

Teacher-centered inferences in TIMSS: changes needed in data collection
Although the current sampling procedure and data collection in TIMSS do not permit simple inferences about the distribution of characteristics for teachers who participate in instruction of mathematics or science in the fourth or eighth grade, it is possible to add a new school-level form to permit such inferences without changing other aspects of sample design and data collection described in Johansone (2020).For each grade examined (4 or 8), the required new form for a participating school i includes a list of the C i classes eligible for sampling.The list specifies for each eligible class all teachers of mathematics or science who instruct at least some class students.2).We acknowledge that this list is more complex than the current class listing form and requires some additional work by the school coordinators.We therefore recommend a field trial to provide a thorough usability test.With the new listing form, it is straightforward to determine the number d it of classes taught, at least in part, by a teacher t in school i.It is quite common in the fourth grade to have a single teacher who provides all mathematics and science instruction for a class.In this case, values of d it will typically be small.On the other hand, it is much less com- mon in the eighth grade for only a single teacher to provide all mathematics and science instruction for a class.Thus larger values of d it may be encountered.Given the new form, no other procedures in TIMSS need be changed in order to replace student-centered weights by teacher-centered weights.

Adjustment for teachers in multiple schools
If a teacher works in the target grade and subject in more than one school in the target population, then the selection probability is affected.We propose to handle this situation as done in other studies like ICCS (Zuehlke and Vandenplas, 2011), ICILS (Meinck and Cortes, 2015), and TALIS (OECD, 2014).This is, we propose to add in the teacher questionnaire the question: "At the moment, in how many other schools do you teach mathematics [/science] to target grade students?".Based on the response, another weight adjustment factor would be included into the computation of the teacher weights, calculated as the inverse of the total number of schools a teacher teaches target grade students in the respective subject.E.g., the total weight of a science teacher teaching this subject to target grade students in two schools will be halved.Note that this weight adjustment factor is called the "teacher multiplicity factor" or "teacher multiplicity adjustment" in the studies cited above, but should not be confused with the multiplicity adjustment of the MAIS approach.Both address the issue of multiple selection probabilities of teachers, the difference however is that one handles multiple selection probabilities within the sampled school, and the other one in different schools (whether sampled or not).For a more formal description of the computation see, e.g., Meinck and Cortes (2015).
To gain insights if weight adjustments for teachers working at more than one schools would be needed in practice, we analyzed the TALIS 2018 database 2 .TALIS is a teacher and school leader survey with 48 participating education systems in the 2018 cycle.The core target population is lower secondary school teachers (ISCED level 2), but countries can also survey lower and upper secondary schools (ISCED 1 and 3).For each education system a sample of about 200 schools and 20 teachers per school was drawn (OECD, 2019b).Table 1 shows the number of TALIS 2018 participating education systems that have a specified weighted percentage of teachers who indicated working at more than one school.The weighted percentage of teachers reporting working at more than one school is less than five for most of the education systems.But there are also education systems in all four groups for which the estimated percentage of such teachers exceeds 10.Note that it is likely to happen even more rarely that teachers teach the TIMSS and PIRLS target grades in multiple schools, as ISCED levels cover multiple grades while TIMSS and PIRLS cover just one grade.This finding implies that weight adjustments might be necessary for only a limited number of educational systems, and it supports our decision to ignore this issue for the study following later.

Sample sizes
To explore the use of TIMSS and PIRLS data for the practical implementation of teacher-centered weights, the teacher sample sizes of both studies were investigated by using the TIMSS 2019 3 and PIRLS 2016 4 databases.The TIMSS 2019 sample sizes for teachers and schools, were calculated separately for each participating country or benchmarking system 5 and for each of the four defined populations (see Table 13 in the Appendix).Within each population, only unique teacher identifiers (IDs, variable IDTEACH in the TIMSS and PIRLS databases) and unique school IDs (variable IDSCHOOL in the TIMSS and PIRLS databases) were considered.One result of this approach is that a teacher of two sampled classes is only considered as one teacher in 4 PIRLS 2016 International Database, https:// www.iea.nl/ data-tools/ repos itory/ pirls, (assessed on January 21st, 2022).
5 Since TIMSS 2003, TIMSS introduced a so-called Benchmarking Program, which also allows sub-entities of countries to participate in the survey (Martin and Mullis, 2004).We will use the term educational system for a participating country or benchmarking system in the following.
the calculation of the respective sample size.The same approach was taken for PIRLS, where only one teacher population would be considered, that is reading/language teachers of fourth-grade students.
In TIMSS the sample sizes of teachers vary substantially among participating education systems (summarizing statistics for the four teacher populations can be found in the Tables 2 and 3).For example, the teacher samples of fourth-grade mathematics teachers in Pakistan, Northern Ireland, and Hong Kong SAR are rather small (below 160) whereas the United Arab Emirates' sample size is 1073.Overall, the sample size exceeds, with few exceptions, 150 in all teacher populations and the minimum sample size of schools over all populations is at least 98 (Malta).This seems to be a promising finding in regard to future teacher-centered analyses.On average sample sizes vary between 266 (fourth-grade mathematics teachers) and 382 (eighth-grade science   A comparison of sample sizes of mathematics versus science teachers in the fourth grade shows that the two sample sizes do not differ much in most of the educational systems.This result is partly due to an overlap of science and mathematics teachers in the fourth grade.In 43 educational systems more than 50% of the mathematics teachers teach science in addition; and in 18 education systems even more than 90% of the mathematics teachers teach science in addition.Exceptions are educationalsystems like Bahrain, Kuwait and South Africa.These educational systems have as many mathematics as science teachers and no overlap between these groups.When comparing educational systems that participated in both surveys, TIMSS for the fourth grade and TIMSS for eighth grade, most of them (27 out of 38) have a larger science teacher sample in the eighth grade compared to the fourth grade.
The sample sizes of teachers were also analyzed on school level.Figure 3 displays the percentages of schools with a given number of participating teachers per school in TIMSS 2019, lines combine the values for a given education system.As can be seen from the figure, there is substantial variation in between countries regarding the obtained number of teachers per school, affecting the total sample size of teachers.In the majority of sampled schools in all countries, only one or two teachers are obtained.This result can also be concluded from Fig. 4, which shows the international mean percentage of schools that have 1, 2, 3 or more than 3 teachers per school.The situation is slightly different when looking at eighth-grade science teachers, where data of four or more teachers is collected from each school in a significant number of countries, related to the fact that specialist teachers of the different sciences (physics, chemistry, earth science, biology etc.) exist and respond to the questionnaires.Consequently, given the current TIMSS sampling design, the sample size for the four teacher populations of interest can vary in between a minimum determined by the minimum school and class sample size (150 schools with one class in TIMSS), multiplied by the school, class and teacher participation rate, and a relatively large number in countries with large school samples, multiple selected classes within schools, or where structural conditions require multiple teachers teaching a class.Very small countries with school censuses (e.g., Malta) may have even smaller samples.
The sample sizes of fourth-grade teachers in PIRLS show similar pattern as the ones in TIMSS.Sample sizes of teachers vary between 122 (Macao SAR) and 1119 (Canada).On average educational systems have a sample size of 271 teachers.In most of the participating schools, one or two teachers participated in the survey.More information about the sample sizes in PIRLS can be found in Figs. 4, 5 and Table 13.

Sample variances for estimates of teacher variables
Efforts described above to achieve teacher-centered teacher weights are only reasonable if the results have an acceptable level of precision.In the following, we investigate what would be likely levels of sampling variance when estimating teacher population characteristics.Large sampling variance could be due, among other factors, to relatively small samples or relatively large variance of weights.An acceptable level of sampling variance could be determined in various ways.One standard involves the accuracy of student-centered teacher summaries that TIMSS currently reports.Another standard is based on the regular TIMSS requirements for measurement of student achievement that national student samples should provide for a standard error no greater than .035standard deviation units for the country's mean achievement.Sample estimates of any student-level percentage estimate (e.g., a student background characteristic) should have a confidence interval of ±3.5% (LaRoche et al., 2020).Given the relatively small teacher samples, this precision cannot be reached, even if the design effect of estimates associated with the teacher samples would be close to 1 due to clustering effects expected to be negligible (very small cluster sizes; teacher variables have lower intra-class correlation coefficients than student variables (Meinck, 2015b)).However, given the sample sizes presented in the Table 13 (see Appendix), many but not all precision levels can be expected to correspond to an effective sample size of at least 150, a value that translates to a standard error of .08 standard deviation units.We claim that teacher population estimates reaching these respective minimum levels of precision can be deemed satisfactory.Moreover, it might be informative to compare the sampling variance of an estimator based on teacher-centered versus student-centered teacher weights (Dumais and Morin, 2019;Schulz, 2020).We use TIMSS 2019 data for the analysis.However, because we are missing one important piece of information to compute the teacher-centered teacher weights, namely how many classes a participating teacher teaches, we consider some plausible scenarios to suggest possible results of teacher-centered weights.These scenarios clearly do not obviate the importance of a pilot study to examine teacher-centered weights, but they do provide some indication of how results for teacher-centered weights might differ from those from student-centered weights.
In this discussion, student-centered weights for teacher characteristics are computed according to current reporting practice in TIMSS 2019.For teacher-centered weights, results are obtained for approximations of the MAIS approach.We consider the following two scenarios.
Scenario 1: Class-centered weights.The teacher-centered MAIS weight W itM for teacher t in school i of stratum h is certainly no greater than the sum W itC of the class weights G hij2 for all the sampled classes j associated with teacher t.This sum is used for class-centered weights.The class-centered weight W itC is W itM if teacher t teaches all classes, so that d its = d it = C i , or if teacher t only teaches a single class, so that d its = d it if t is sampled.
Scenario 2: School-centered weights.Because the class factor F hij2 is always at least 1, the expected value W itH of W itC for a sampled teacher t is always at least as large as the school weight F hi1 .In a few educational systems participating in TIMSS 2019, F hi1 = W itM = W itH .This situation only applies to Malta and Pakistan for the fourth grade for mathematics and science because all classes and teachers are sampled.
To assess the accuracy of the weighted means under study, jackknife repeated replication (JRR) for schools was employed as in TIMSS 2019 and a parallel analysis (SRS) was employed based on the classical formula for estimation of the variance of a weighted mean under simple random sampling (Cochran, 1977, Chapter 6).As in the JRR results, a finite sampling correction is not used.JRR has the advantage of consistency with current practice, but it should be emphasized that the resulting estimated standard errors need not be accurate.The use of JRR and the use of SRS are both based on assumptions of random sampling with replacement that clearly do not apply given that populations of schools are finite, sampling of schools is without replacement, and sampling of schools within strata is systematic with a random start (Kish and Frankel, 1974).The issue of appropriateness of use of JRR in TIMSS also applies to existing student-centered weights.Nonetheless, the estimates may provide some guidance concerning reasonable expectations.
As an added check, unweighted results assuming simple random sampling with replacement of teachers in an educational system were obtained and both JRR and SRS were applied.
The full table of results is very large.For each of the two grades and two subjects, seven items were considered for this analysis (see Table 4; for further details on variables and scales see Martin et al. (2020)).We considered exclusively items that would provide interesting information on characteristics of the teacher population such as gender, age, teaching experience, job satisfaction etc.We did not consider variables that are related to a specific class and would hence not be suitable for teacher-centered analysis.Occasionally teacher responses were missing or inconsistent.The teacher's responses for science or mathematics were defined as the average of the responses not missing if more than one teacher questionnaire was available.
For simplicity, this study primarily involves the study of weighted means; however, other summary statistics could easily be examined with the same methodology.For example, cumulative distribution functions can certainly be examined.
For the fourth grade, TIMSS 2019 provides data for 64 educational systems, while in the eighth-grade, data for 46 educational systems are available.Thus in all, our analysis results in a table with 1540 rows.Table columns include the code and name of the educational system, the grade, the subject, the number of observations with item responses, the number of observations with omitted responses, the four estimated means, and the four estimated standard errors.Hence the full table is too large for presentation in this paper; however, it is available in supplementary materials as an R data frame and as an Excel spreadsheet.
A simple summary of results for the raw means and three weighted means is provided in Tables 5 and 6.Because variables vary considerably in their ranges, corresponding summaries of weighted standard deviations are provided in Tables 7 and 8.These summaries are averages across participating educational systems for each grade, subject, and item.Thus by themselves they only provide a rough notion of results.Nonetheless it is worth noting that different weighting approaches do yield relatively similar average results across countries.
In terms of effect sizes in which the difference of means for an item, country, subject, and grade is divided by the square root of the average of the corresponding variances, the average absolute value of the effect size for student-weighted versus class-weighted means is 0.036, while the corresponding average for student-weighted versus school-weighted means is 0.069.These average effect sizes are relatively small.Averages within grades and subjects vary little.Figure 6 provides an illustration of the similarity of student-centered (x-axis for each panel) and class-centered means (y-axis for each panel) in the case of science in the eighth grade (complementary figures for all other scenarios-school-centered means, mathematics and science both grades-can be found in the Appendix, see Figs. 8,9,10,11,12,13,14).To place all items on the same scale, the minimum value of the item score is subtracted from the mean and the result is divided by the range of the item score.Thus all values are between 0 and 1.For reference, the diagonal line has intercept 0 and slope 1. Clearly all points are very close to the line.Nonetheless, despite the reported averages, it should be noted that effect sizes can sometimes be large.The most extreme case for comparison of student-centered and class-centered weights occurs in Pakistan for mathematics in the fourth grade for item ATBM10.In this case, the student-centered weighted mean is 2.683 and the class-centered weighted mean is 2.170.The respective weighted standard deviations are 1.671 and 1.479, so the effect size is 0.325.For comparison of student-centered and school-centered weighted means, the most extreme case is in the United States for mathematics in the eighth grade for item BTBM23.The student-centered weighted mean is 3.528, and the school-centered weighted mean is 2.910.The respective weighted standard deviations are 1.132 and 1.157.The corresponding effect size is 0.540.In these two instances, the difference in weighted means can have substantial effect on interpretations of results.
Standard errors are usually a significant concern in large-scale assessments because these studies rely on complex samples.These samples are characterized by various features such as stratification and clustering which prevent using standard formula (assuming SRS) to estimate standard errors (Lohr, 1999).Looking at standard error estimates using both the SRS and the JRR approach we investigate whether this may also be a concern for teacher-centered analysis.A summary of design effects is provided in Tables 9  and 10.
These design effects are averages over countries of squares of the ratios of standard errors from JRR and SRS.Average design effects are often close to 1, especially in the unweighted case, but average ratios are much higher in weighted cases for the scales ATBGTJS and BTBGTJS.Thus the design effects indicate a small but non-negligible effect of the complex design on standard errors, likely clustering and unequal weights being the driving forces (see Meinck and Vandenplas (2021) for more details).The most extreme design effects are quite large.In the case of school-centered means for item ATBG02 in Latvia in the fourth-grade mathematics, the design effect is about 28.9, however, there is a fundamental difficulty in this case because only one of 200 sampled teachers of mathematics in the fourth-grade reports being male.In this case, instability of estimates of standard errors (and design effects) is not surprising.On the other hand, for class-centered means, 14.4, the largest design effect, arises in Australia for mathematics in the eighth grade for item BTBGTJS, pointing to a substantial clustering effect regarding job satisfaction of eighth-grade mathematics teachers in this country (i.e., teachers within the same school tend to have similar job satisfaction levels), inflated by the high variance in weights.For student-centered means, the most extreme ratio, 11.8, arises in Dubai for item ATBGTJS for mathematics in the fourth grade.Given these results, further analysis will be based on JRR, and a clear recommendation for using standard error estimation methods accounting for the complex designs is warranted.As evident from Tables 11 and 12, standard errors are a major concern in any of the weighted means under study.We noted above that a standard error of .08 standard deviation units might be deemed acceptable, however, even the average ratio between standard errors and standard deviations6 is higher for most variables and weighting scenarios, meaning more than half of the ratios for specific countries are higher than this value.Student-centered and class-centered estimates have similar ratios of standard errors to standard deviations, and results for school-centered weights are a bit worse.The least satisfactory results are associated with the job satisfaction scales ATBGTJS and BTBGTJS.
To check more thoroughly on the issue of standard errors, it is helpful to examine cumulative distribution functions of JRR ratios of standard errors to weighted standard deviation (scaled JRR standard errors).Figure 7 provides an example for school-centered weighted means (complementary figures for other grades, subjects and weighting scenarios can be found in the Appendix, see Figs. 15,16,17,18,19,20,21), with the ratio of standard error to standard deviation on the x-axis for each panel and the cumulative distribution function on the y-axis for each panel.Clearly results are rather variable for different educational systems.As evident from the vertical line at 0.08, it is certainly not uncommon for ratios to be less than 0.08; however, occasionally ratios are about 0.3, pointing to very imprecise estimates.A basic issue is the existence of enough responses, depending not only on sample size but also on participation.For example, the value of 0.332 for England involves only 86 responses, due to low participation rates at both school and teacher level.On the other hand, the issue is a bit more complicated.For example, for item BTBGTJS in the United States, 426 responses are present but the ratio is 0.240.Some explanation is provided in terms of the effective sample size measure equal to the ratio of the square of the sum of the weights to the sum of the squares of the weights (Kish, 1965, p. 259).In the case of the United States, the sample size for science teachers in eighth grade is 468, but the effective sample size for school-centered weights is only 32.7, pointing to a very large design effect of almost 15.The effective sample size for the United States is so low because some sampled schools have very low probabilities of being sampled and hence very high weights.These very low probabilities reflect very small school sizes.The effect on the weights could even not be compensated by a method applied in TIMSS and PIRLS to minimize fluctuations in sampling weights, that is, set uniform selection probabilities when sampling small schools.For example, for eighth grade one sampled school had only one sampled student and another had only two sampled students.This result reflects a decision not to exclude very small schools from the American sample and a decision in TIMSS not to apply methods to reduce unusually high weights, which may be reconsidered in future cycles of TIMSS.The exclusion for small schools is not unusual in other educational systems participating in TIMSS, and standardizing this approach may be an effective measure to avoid large variance in weights also for the student sample.At the moment, TIMSS allows exclusion of small schools covering up to 2% of the student population.For example, in Gauteng and Western Cape, schools in the sample for eighth grade must have at least  10 students.Another reasonable approach to consider is the application of exclusion rules for teacher analysis not applied for student analysis due to the much smaller number of teachers in an educational system.Overall, according to the considered scenarios, teacher-centered analysis seems to be possible with fairly reasonable precision using the MAIS approach, although some limits exist for specific variables and educational systems.In any case, the results suggest that analysis of teachers in any educational system participating in TIMSS generally cannot effectively examine subgroups given the number of teachers sampled.

Summary, conclusions and recommendations
TIMSS and PIRLS expend significant effort and cost to collect and analyze data for an elaborate explanatory model covering student achievement in the areas of mathematics, science, and reading, and the contexts of learning these subjects.The ability to analyze teacher-level characteristics from proper samples drawn from teacher populations is not included in their study designs, as choices had to be made to keep the costs and complexity levels of these studies manageable.Still, a rich array of data related to teacher characteristics is collected, and scholars wish using this data to investigate characteristics of teachers.This paper builds on the work by Hooper et al. (2022), extending their introduction of two approaches to derive weights for teacher-centered analysis using TIMSS and PIRLS data by looking into aspects of practical implementation of these approaches.
We began with proposing a definition for teacher target populations, tied to the grades and subjects they teach, in line with the focus of the two large-scale assessments.This definition should help to correctly and comprehensively identify all in-scope teachers within schools sampled for TIMSS and PIRLS, being a requisite for accurate estimation of population characteristics.We then formalized the computation of teacher-centered weights and using them to derive teacher-centered population estimates, and discuss some issues and limitations related with this.We highlighted the utility of both, student-centered and teacher-centered analysis, depending on the research question to be answered, and disentangled the differences between the two types of weights.Next we suggest a procedure and form on how to collect data about teachers that is needed to derive teacher-centered weights, yet currently unavailable.This step is key if in future cycles teacher-centered weights should be derived in TIMSS and PIRLS.Alternative forms or procedures may work, and optimal solutions may depend on the particular situation in participating countries.We however recommend here a standardized procedure that can be applied in all countries, a feature that is important in ILSA to support their dense timelines, high quality standards, and production modes.Collecting this additional information demands slightly more work by school coordinators, and a small adjustment in operations, that may be well justifiable given the possible gain in knowledge.
We also tackle the issue of non-response by proposing a non-response adjustment factor in line with existing approaches in ILSA, as well as mentioning the challenge of multiple selection probabilities when teachers teach in multiple schools, where we refer to solutions applied in other ILSA.
The core part of the paper focuses on studying the level of accuracy that can be expected when estimating teacher population characteristics.We look into sample sizes as they are a fundamental factor related with precision.Then we use TIMSS 2019 data and simulate likely scenarios regarding the variance in weights.Identifying the MAIS method as the method most effective for TIMSS and PIRLS as it can handle within-school stratification, we continue only with this method.The results show that the different weighting scenarios (including using no weights) lead to relatively similar estimates, at least on average, however with large enough differences for specific variables and countries to warrant the recommendation to use teacher-centered weights for analysis of teacher populations rather than student-centered weights.Second, results provide evidence to use weights and an algorithm to estimate standard errors that accounts for the complex sampling design, as standard error estimates would otherwise be systematically biased.We find further that sample sizes and variance in weights are significantly limiting estimate precision.Especially the large variation in weights induces particularly large design effects.Hence, while characteristics of whole teacher populations can be estimated with sufficient precision in the majority of countries, we discourage estimating subpopulation features (such as, for example, job satisfaction of male teachers), and we strongly recommend that, to avoid unreasonable interpretations, analysts with research questions should thoroughly check sample sizes and variances in weights of the populations of interest.However, if such research questions are deemed of high interest, national research coordinators should discuss options to adjust the sampling design for their countries.Options that would not jeopardize the core objective of TIMSS and PIRLS (that is, studying students) include increasing the number of schools or classes (and thereby teachers) selected and extending the teacher survey to teachers not sampled by way of student sampling.
The results presented here are of limited reliability as they are based on plausible scenarios rather than real data that permit computation of teacher-centered weights.Therefore, the next step is actual implementation in one or more countries, followed by replicating the analysis presented here with real data, which would allow a critical evaluation of our results.

Fig. 1
Fig. 1 TIMSS fourth grade adjusted class listing form

Fig. 2
Fig. 2 TIMSS 2023 fourth grade class listing form teachers).Differences in sample sizes can be explained by several factors such as the school and class sample sizes, the number of teachers associated with a class and the non-response rate.Due to the sampling procedures in TIMSS, student sample sizes (which ultimately determine school and class sample sizes) significantly affect the size of the teacher samples, being generally positively correlated.For example, England with 3365 sampled students has the lowest student sample size in the eighth grade(Martin et al., 2020, Exhibit  9.6) and accordingly a below-average teacher sample size.The opposite is the case for the United Arab Emirates, where the 22,334 participating students is by far the highest student sample size in the eighth grade(Martin et al., 2020, Exhibit 9.6) and with 1036 mathematics and 1180 science teachers the largest teacher sample sizes.

Fig. 3
Fig. 3 Number of participating teachers per school in TIMSS 2019 by education system

Fig. 4
Fig. 4 Number of participating teachers per school in TIMSS 2019 (international average)

Fig. 5
Fig. 5 Number of participating teachers per school in PIRLS 2016 by education system

Fig. 6
Fig. 6 Scaled student-centered means versus class-centered means: eighth grade science

Fig. 7
Fig. 7 Scaled JRR standard errors for School-centered means: eighth grade science

Fig. 20
Fig. 20 Cumulative distribution function of scaled JRR standard errors for class-centered means: grade 8 mathematics

Table 2
Number of TIMSS 2019 educational systems by teacher sample size (categorized)

Table 3
TIMSS 2019: summary of teacher sample sizes

Table 4
Items used in comparisons of teacher weights

Table 5
Unweighted and weighted means for grade 4 (average across countries)

Table 6
Unweighted and weighted means for grade 8 (average across countries)

Table 7
Unweighted and weighted standard deviations for grade 4 (average across countries)

Table 8
Unweighted and weighted standard deviations for grade 8 (average across countries)

Table 9
Design effects for unweighted and weighted means: grade 4 (average across countries)

Table 10
Design effects for unweighted and weighted means: grade 8 (average across countries)

Table 11
Ratio of SRS standard errors to standard deviation for grade 4 (average across countries)

Table 12
Ratio of JRR standard errors to standard deviation for grade 8 (average across countries)

Table 14 Notations
Number of schools attended by the studentsin the target population Number of students sampled who mighthave participated.(It is possible due to class changes that n ij2 and n ij1 differ.)Measurement of student k from class j of school iK ijkNumber of teachers for a subject (mathematics or science) of student k in a participating selected class j from participating selected school i , W it Student-centered teacher weight for teacher t in target schools iS itThe sum of the fractions 1/K ijk for all students k in a class j of school i who are taught by teacher t itT Final teacher sampling weight according to the HT approachF itT = 1/T itTeacher component of the final sampling weight E U Teacher-based average, of U it for teachers t in target schools i

Table 14 (
continued) Total of the U it for the D + teachers t in target schools i U Real variable defined for all combinations of schools and teachers of a subject who teach students in the target population U i+ Sum of U it for all teachers in school i Sum of the W itT U it for sampled teachers t in sampled schools i MAIS approach *-notation refers to Mais approach W itT * Final teacher sampling weight according to the MAIS approachF itT *Teacher component of the final teacher sampling weight according to the HT approach according to the MAIS approach Sampling both of the distinct schools i and j in stratum hζ hij p hi p hj − p hij v hi = p hi (1 − p hi )Variance associated with the probability p hi Number of classes teacher t teaches in the class stratum that includes class j G hitTc Sum of the class weights G hij2 for all sampled classes j in school i associated with teacher t γ (A, B)Covariance of A and B