NEPSscaling: plausible value estimation for competence tests administered in the German National Educational Panel Study

Educational large-scale assessments (LSAs) often provide plausible values for the administered competence tests to facilitate the estimation of population effects. This requires the specification of a background model that is appropriate for the specific research question. Because the German National Educational Panel Study (NEPS) is an ongoing longitudinal LSA, the range of potential research questions and, thus, the number of potential background variables for the plausible value estimation grow with each new assessment wave. To facilitate the estimation of plausible values for data users of the NEPS, the R package NEPSscaling allows their estimation following the scaling standards in the NEPS without requiring in-depth psychometric expertise in item response theory. The package requires the user to prepare the data for the background model only. Then, the appropriate item response model including the linking approach adopted for the NEPS is selected automatically, while a nested multiple imputation scheme based on the chained equation approach handles missing values in the background data. For novice users, a graphical interface is provided that only requires minimal knowledge of the R language. Thus, NEPSscaling can be used to estimate cross-sectional and longitudinally linked plausible values for all major competence assessments in the NEPS.


Introduction
For decades, educational large-scale assessments (LSAs) have provided insights into educational systems around the globe (e.g., PISA, NAEP, TIMSS, and PIRLS). Usually, these LSAs are cross-sectional and study specific age cohorts (e.g., 15 year olds in the case of PISA; Weis & Reiss, 2019). Although repeated assessment cycles allow longitudinal comparisons on the country level, LSAs providing access to within-person change trajectories are rare. In contrast, the National Educational Panel Study (NEPS) is a longitudinal, multi-cohort study representative for the German population. The NEPS follows newborns to pensioners through repeated assessments across their life courses (Blossfeld & von Maurice, 2011). Currently, the NEPS includes four child cohorts of Page 2 of 15 Scharl and Zink Large-scale Assessments in Education (2022) 10:28 newborns (starting cohort 1), kindergartners (starting cohort 2), fifth graders (starting cohort 3), and ninth graders (starting cohort 4) as well as two grown-up cohorts of university students (starting cohort 5) and adults between 30 and 70 years old (starting cohort 6). A major focus of the NEPS is the coherent measurement of domain-specific competencies such as reading, math, or sciences across all cohorts to study prerequisites and outcomes of education in Germany. In LSAs, competencies are typically analyzed as plausible values (PVs) (Wu, 2005). PVs allow for unbiased population parameter estimation at the population level as they take the uncertainty of the estimation of the latent competences into account by providing multiple estimates representing the likely distribution of the competences (Lechner et al., 2021;. It is important to note that PVs are a special case of multiple imputation, and therefore, the assumptions behind the multiple imputation approach need to be met (Rubin, 1987). Precise PV estimation requires the specification of a background model that is appropriate for the research question at hand. Hereby, all variables (and their interactions) used in later analyses need to be included (Bondarenko & Raghunathan, 2016;Meng, 1994). Because data providers of LSAs cannot anticipate how users will analyze their data, typically all available information collected in a LSA is incorporated into the PV estimation to achieve said unbiasedness (e.g. OECD, 2017). A challenge of longitudinal LSAs such as the NEPS is their growing data base. Each new assessment wave needs to be incorporated into earlier PV estimation to accommodate PVs as independent variables in all possible statistical models (cf. congenial models; Meng, 1994). Paired with the need for completely observed background data, the estimation of ready-to-use PVs in scientific use files (SUFs) quickly becomes impractical. Therefore, we introduce the R package NEPSscaling that offers versatile functionalities to estimate PVs for cross-sectional and longitudinally linked competence assessments in each cohort of the NEPS. Although, plausible values can also be generated with other available R packages such as TAM (Robitzsch et al., 2021), mirt (Chalmers, 2012) or brms (Bürkner, 2017) as well as other standalone software such as Mplus ( (Muthén & Muthén, 1998), what sets NEPSscaling apart from these software packages is its scope and simplicity. It is designed to specifically suit the needs of the NEPS and its data users. Therefore, the package is also aimed at researchers with little expertise in psychometric modeling and novice users of R. It only requires the preparation of custom background data that fits the research question, which can be done with any statistical software as long as the data is exported in an R readable way (e.g., CSV, SPSS, Stata, SAS formats). Then, the package automatically handles missing values in the background data using classification and regression trees (CART) in a nested imputation scheme (Burgette & Reiter, 2010;Weirich et al., 2014). It estimates the appropriate item response models following the scaling standards in the NEPS (Pohl & Carstensen, 2013) to generate PVs suiting the intended analyses. Finally, the generated data can be exported in different formats for various statistical software such as SPSS, Stata, or Mplus. Furthermore, a graphical user interface is provided for novice R users.
In the following, we will briefly outline the statistical background of PVs and then describe the basic functionality of NEPSscaling. The use of the package is demonstrated in two examples that show how to estimate PVs using either R syntax or the graphical user interface.

Plausible values
Following the NEPSscaling standards (Pohl & Carstensen, 2013), most competence tests are scaled using the partial credit model (PCM; Masters, 1982) for polytomous items which models the probability of observing response Y ij for person i on item j as where θ i denotes the latent ability of person i and δ jk the threshold for endorsing category k = {0, . . . , K j } of item j. It simplifies to the Rasch model (Rasch, 1960) in case of binary items. For rotated test designs that administered a given test at different positions for the sample, a multi-facet model (Linacre, 1989) based on the PCM or Rasch model is used to correct for the test rotation. 1 Moreover, longitudinal assessments are linked across measurement waves using mean/mean linking (Fischer et al., 2019) which shifts the latent scale to be anchored at the first measurement's mean location. The PVs technique is an extension of IRT models via a latent regression of the person parameters on background variables . This allows to approximate the population level latent distribution of person abilities more accurately. The latent regression of θ i can be seen as prior information on the person parameters and leads to the formulation of the posterior ability distribution of person i as where p(y i |θ i ) denotes the likelihood of the data, given by the IRT model, and p(θ i |x i ) denotes the prior distribution of the latent ability, given by the latent regression on a set of background variables x i for person i with β L = (β 1 , . . . , β L ) T denoting the regression weights for L covariates, the intercept β 0 , and ε i representing the normally distributed residual. The latent regression should contain all relevant variables and variable configurations such as interactions or nonlinear terms that are part of the planned analyses (Bondarenko & Raghunathan, 1994;Meng, 2016). Similarly, it may be sensible to add further variables to improve the imputation of the background data (Collins et al., 2001), especially if it is not predictable which variable relations are of interest in later analyses .
This also highlights that PVs are a special case of multiple imputation of completely missing variables. Therefore, analyses with PVs have to be conducted separately for each single PV and then combined following Rubin's rules (Lechner et al., 2021;Rubin, 1987). Reaserchers need to check whether the assumption of normally distributed parameter estimates holds, before pooling the datasets using Rubin's rules. (1)

Classification and regression trees
The missing data strategy in LSAs for plausible values estimation typically encompasses re-defining missing values due to item nonresponse as an additional dummy variable during the recoding of the background data. This cannot be seen as effectively handling missing data (Lüdtke et al., 2002;Schafer & Graham, 2017) which is why we adopted nested multiple imputation (Weirich et al., 2014). This strategy first repeatedly imputes the background data and then estimates the desired number of PVs for each imputed data set. Additionally, it can consider dependencies between the ability and the background variables if an ability indicator like the weighted likelihood estimate (Warm, 1989) is used in the imputation model. Please note that this strategy does not apply to unit nonresponse, that is, if a case is completely unobserved in an assessment wave. These cases are handled by listwise deletion.
The background model is imputed using a CART algorithm within the multiple imputation via chained equations (MICE) framework (Burgette & Reiter, 2010;Doove et al., 2014). The algorithm predicts a missing value on one variable by a set of predictor variables. Starting with all non-missing values of the outcome variable, the algorithm recursively splits the nodes into binary partitions until a purity criterion is met, that is, the values left in the leaf nodes of the tree are homogeneous enough (Burgette & Reiter, 2010). If the outcome variable is metric, a regression tree is constructed. It differs from a classification tree for categorical outcomes in its purity criterion and the way, a final value is chosen from the respective leaf nodes. A notable advantage of the non-parametric CART as compared to other parametric imputation approaches, like predictive mean matching (Little, 1988) or fully conditional specification (Raghunathan et al., 2003;van Buuren, 2006;van Buuren et al.,2012), is that the splitting of child nodes automatically implies non-linear relationships as well as interactions in the data without having to explicitly model them (Burgette & Reiter, 2010). Further, also monotone transformations of the independent variables do not affect the trees produced by CART (Breiman et al., 2017).

About NEPSscaling
NEPSscaling is an R package containing functions to facilitate the estimation of PVs for competence domains measured in the NEPS while handling missing values in the background model. Other functions allow the inspection of the specific CARTs used for imputation, accessing parts of the resulting NEPSscaling R object, and exporting PVs for different statistical software. The user can also call information about the implemented competences tests and assessment waves, as well as differences between the package and the pre-calculated competence measures published in the NEPS scientific-use files (SUFs). The SUFs include the prepared survey and test data in a factually anonymized form. To download the SUFs for research purposes, only a data use agreement needs to be signed.

Availability
The package NEPSscaling is not available from CRAN, but is provided by the NEPS research data center at https:// www. neps-data. de/ Data-Center/ Overv iew-and-Assis tance/ Plaus ible-Values. The package is free to download and previous package versions and example code that shows how to use the package to estimate PVs in different cohorts are also available. To install the package, the following command can be used install . packages ( " NEPSscaling ", repos = c ( " http : // nocrypt .neps -data . de / r" , " https : // cran .r -project . org " ))

Basic functions
In the following the most important functions are described in the order in which they would occur during a typical use of the package. The functions currently_implemented() and deviations_of_package_from_suf() require no arguments and give an overview of the current state of NEPSscaling. The former shows which competence tests are available for which starting cohort by domain and wave, while the latter reports known deviations in comparison to the point estimates of the latent competences (WLEs) provided in the NEPS SUFs.
The main function for generating PVs is plausible_values() which loads the raw data from the scientific use files, imputes missing values in the background data, creates the appropriate scaling model for the chosen competence test, and estimates either cross-sectional or longitudinally linked PVs.
The function expects several arguments specifying the PV estimation; most of them are optional: • SC (required): The starting cohort is given by specifying its integer equivalent (e.g., the adult cohort is listed as starting cohort 6). • domain (required): The chosen competence domain is indicated by the two or three letter abbreviations summarized in Fuß et al. (2021). Because not all competence domains have been assessed in each cohort, users have to specify the correct combination of SC and domain as indicated by currently_implemented(). • wave (required). The assessment wave is given by an integer value as summarized in Fuß et al. (2021). For example, the tests of the starting cohort 6 (adults) took place in the waves 3, 5, and 9. • path (required): Because the function automatically loads the relevant data from the scientific use files, the path to the data on the hard drive needs to be specified as a string (e.g., "C:/Users/name/NEPS_data/" on a Windows machine). • bgdata (optional): The background data needs to be provided as a data.frame containing the person identifier ID_t. If no background data is provided, PVs without a background model are estimated. Note that the package automatically includes the number of not-reached missing values as a proxy for processing times and, if the assessment took place in the school context, the mean competence per school as a proxy for the multi-level sampling design; this default setting can be changed using the arguments include_nr and approximate_school_context explained below. In the cross-sectional case, the argument is a character vector (e.g., c("var1", "var3")), while it must be a named list (e.g., exclude = list(w1 = "var1", w3 = c("var1", "var3"))) in the longitudinal case specifying the excluded variables for each wave. • seed (optional): For reproducibility, the specific seed can be set for the random number generators. . After the estimation of PVs, the functions print(x) and summary(object) give a quick overview of the specified model and estimated model parameters. The only required argument is the R object resulting from using plausible_values(). To facilitate the exploration of the resulting R object, the package also contains a number of extraction functions such as get_domain(pv_obj), get_info_criteria(pv_obj), get_pv_list(pv_obj), or get_pv_index (pv_obj, index). Moreover, the CART imputation can be visualized with display_imputation_tree(pv_obj, imputation, variable) that generates a plot displaying the specific tree constructed to impute a single variable. If the graphical representation becomes too complex, a character representation of the tree can be inspected using get_ imputation_tree (pv_obj, imputation, variable).
The package also provides means to easily export the estimated PVs together with their imputed background data in case analyses with the PVs are to be conducted using different software. The write_pv() function takes the arguments pv_obj, that is, the resulting R object, path where the data is to be stored and ext, a string indicating the storage format (i.e., SPSS, Stata, or Mplus).

Typical workflow
Estimating PVs for competence tests in the NEPS typically follows several consecutive steps that depend on two data sources. First, the SUFs including the raw competence test data need to be obtained from https://neps-data.de. Data access requires a free, noncommercial data use agreement with the NEPS research data center. 2 This data is necessary to estimate the IRT part of the plausible values model and needs to be stored in a way that it is accessible to the current R session. Ideally, all raw data files can be found in the same folder. Second, the carefully selected background variables for the PV estimation need to be prepared by the user to ensure congeniality with the intended analyses. This data preparation can be done using any statistical software and the user is advised to check the data for plausibility before starting any further analyses. The only requirement for using NEPSscaling is that the resulting background data is stored in tabular format either in SPSS's sav format, Stata's dav, or R's rds format (when using the graphical user interface) or that it is imported into the current R session as a data.frame (when invoking the estimation using the R script). Special attention needs to be paid to missing data in the background data as they must be coded as R's NA values and categorical variables have to be converted to R's factors. Further, the selected background variables should either be assessed at the same time point for which the PVs will be estimated or include time constant information to avoid inconsistencies. After the preparation of the background data, PVs can be estimated via an R script and the functions outlined above or via the graphical user interface provided by the NEPSshiny app.
Last, NEPSscaling versions always depend on different versions of the SUFs because the competence variables in the SUFs are addressed by the package's functions. Therefore, if variable names are changed in the SUFs, they are changed accordingly in the newest version of the package. As a consequence, the names of newer SUF versions and older package versions and vice versa are no longer compatible. Thus, it is recommended to always use the latest versions of both SUF and package to ensure a match. Further, it is advised to state package versions explicitly to ensure reproducibility.

Surrounding NEPSscaling
The estimation of PVs can and should be conducted with several things in mind: First, the imputation model may unduly impact the results and thus, conducting sensitivity Page 8 of 15 Scharl and Zink Large-scale Assessments in Education (2022) 10:28 analyses for different imputation models or evaluating the efficiency gains obtained through using PVs is advisable. Second, there are particularities in working with plausible values. For example, it is necessary to conduct any further analyses (e.g., regression analysis) which use PVs separately for each set of PVs. The results of these analyses then need to be pooled using Rubin's rules or any other appropriate pooling procedure (Raghunathan et al., 2003). For further information how to correctly work with PVs see von Davier et al. (2009) and for further information on the estimation of PVs as well as an example for pooling the results of Scharl et al. (2020).

Applications
In the following, two example applications are presented that use simulated data sets included in the package. The data was modeled to closely resemble the adult starting cohort (SC 6) and the 5th grader starting cohort (SC 3). The first example will be presented using a classic R script, whereas the second example uses the NEPSscaling Shiny app. The aim of the presented applications is to demonstrate basic analyses. Real and more complex examples can be found in further user examples given at the downl oad site of NEPSscaling as well as simulated examples for background data.
The input data in both example applications is dictated by the NEPS SUF format. The SUFs are available as SPSS or Stata tables. NEPSscaling uses the competence data as it was downloaded. The background data, on the other hand, needs to be prepared by the user as described in chapter 3.2 and should contain the set of analysis variables as well as optional further variables that would improve the imputation of missing background values or the estimation of PVs. NEPSscaling internally selects only those subjects in the background data set who have contributed at least the minimum number of valid responses in the competence test of interest.

Application 1: Cross-sectional reading competence in the adult starting cohort
Estimating plausible values using an R script is straightforward. After preparing the background data in any statistical program and storing the background data in one of the supported file formats, there are three steps until PVs are ready for further processing. The first step consists of installing NEPSscaling, setting the workling directory for importing the prepared background data into R and loading NEPSscaling. Then, PVs can be estimated. It is important to specify the correct path to the competence data. Here, the competence data is stored in a folder called SC6. However, the name of the folder can be chosen freely as long as the path is specified correctly. library ( NEPSscaling ) setwd () bgdata <-readRDS ( " bgdata . rds " ) pv _ obj <-plausible _ values ( SC = 6, domain = " RE " , wave = 3, path = "./ SC6 / " , bgdata = bgdata ) summary ( pv _ obj ) Below, the abbreviated summary of the estimated model is given. It contains the basic parameters of the estimated model, mean, variance and reliability estimates of the PVs, the fixed item difficulties, and the estimated latent regression weights. The latter cannot Page 9 of 15 Scharl and Zink Large-scale Assessments in Education (2022) 10:28 be used to answer the intended research questions, but are giving insights into the influence of the chosen background variables on the estimation of the PVs. They are not meant to be used in further analyses. In a final step, the plausible values and the imputed background data can be exported for further analysis (here: SPSS file format).
write _ pv ( pv _ obj , path = " / SC6 " , ext = " SPSS " ) Scharl and Zink Large-scale Assessments in Education (2022)  Page 11 of 15 Scharl and Zink Large-scale Assessments in Education (2022) 10:28 Application 2: Longitudinal math competence in the 5th grader starting cohort Using the Shiny app is less concise, but also more intuitive if there is little to no prior experience with R. The functions corresponding to application 1 are illustrated below; additional functionality is shown in the online supplemental material accompanying this paper. After the package has been installed, NEPSshiny can be launched by invoking the following R code in an R session started, for example, by RStudio.

NEPSshiny ( launch . browser = TRUE )
The start screen of the app can be seen in Fig. 1. It allows the import and export of the underlying background data and previously estimated plausible values objects. It can be reached at any time by pressing the NEPSscaling logo in the upper left corner. To estimate a new set of PVs, the first step is to import the background data. Tabular data in R, SPSS and Stata file formats of up to 30 MB size can be imported. The data selection works by browsing the file system. The button "Remove background data" removes the  Page 12 of 15 Scharl and Zink Large-scale Assessments in Education (2022) 10:28 currently available object from the Shiny app's working environment. The inspection of background data is covered in the supplemental material (Additional file 1: Figs. S1-S3). After uploading the background data, the scale level of the data needs to be set (see Fig. 2) because categorical data is processed differently than metric variables in the imputation and estimation steps of NEPSscaling. The differentiation of ordinal and nominal variables becomes important for the aggregation of the imputed background data.
Next, we enter the "Estimate Plausible Values" tab (see Fig. 3). The application example is concerned with estimating PVs for the 5th grader cohort, SC 3. The goal is to obtain mathematics PVs for longitudinal analyses. Figure 3 shows how the SC, competence domain and assessment wave have already been set. Please note that the assessment wave can be any of the waves for the SC and domain combination in longitudinal estimation. Furthermore, the path to the competence data, set to the current working directory by default, has also been changed to the current location of the SC 3 SUFs.
In this configuration, ten cross-sectional plausible values for wave 1 are estimated. To switch to longitudinal estimation, the button at the top of the expanded "Customize model parameters" field as seen in Fig. 4, subfigure 1, needs to be checked. This leads to the further expansion of the field seen in subfigure 2 of Fig. 4. It is now also possible to exclude variables of the background data from the estimation of plausible values for specific assessment waves.
If all parameters are set to the intended model, the "Start estimation" button (see Fig. 3) can be pressed and the PVs are estimated. A summary of the current plausible value object can be inspected in the "Manage" tab (corresponding to the print() Fig. 6 Summary tables of (1) item and (2) regression parameters Page 13 of 15 Scharl and Zink Large-scale Assessments in Education (2022) 10:28 statement; see Fig. 5) and in the "Tables" tab where the item parameters (subfigure 1 of Fig. 6) and the estimated regression weights (subfigure 2 of Fig. 6) are displayed. Further visual inspection of the object is possible and shown in the supplemental material.

Summary
As can be seen in the application examples above, the main benefits of NEPSscaling lie in its simplicity. With this package, NEPS data users can use PVs for their population level analyses in only a few steps and without worrying whether the unknown background model of the PVs available in scientific use files actually fits their own analyses. Nevertheless, there are further notices regarding the package. Before starting any analysis, users are required to have substantial knowledge on their used data. The package NEPSscaling does not release the user from their duty of knowing and understanding the data. The use of custom background data means that this data has to be prepared additionally by the users. However, data has to be prepared for the analyses in any case and the analysis data is identical to the background data of the PVs in most cases. The added amount of time and effort, thus, reduces to considering additional variables for the imputation of missing values and the estimation model. Similarly, the measurement models are restricted to tested scaling models. If a more flexible IRT model is desired, for example the three-parameter logistic IRT model, users will have to resort to other software solutions such as the R package TAM, on which NEPSscaling is based, or mirt or Mplus. Further information on the original scalings of the tests in the NEPS are available in technical reports on the NEPS websi te. It is important to mention that competence data between different starting cohorts cannot be linked as the estimation of plausible values is only possible within a specific starting cohort. Furthermore, the package is not available via CRAN, it is downloadable from the NEPS RDC's website without any further requirements or restrictions.
The package will be updated after each new release of competence data in the SUFs so that the users can use PVs for NEPS competence assessments as soon as possible after the SUF release. NEPSscaling was specifically designed to conduct analyses with NEPS data, therefore, it is required that users have access to NEPS data. Researchers are required to sign a data use agreement with the NEPS Data Center for data access.
In conclusion, NEPSscaling provides PVs for all scalable competence measurements in the NEPS with the additional benefit of automatically implementing an imputation scheme for the background data. Because of the non-parametric nature of the CART algorithm, it only requires the selection of the correct variables for the imputation model, but not its full specification. Non-linear relationships in the data are implicitly considered in the imputation with CART. Furthermore, NEPSscaling makes estimating PVs easier than non-study-specific packages like mirt or TAM since it does not require the specification and testing of a scaling model by the user. The quality of the estimation is checked and tested by the maintainers specifically for each model. The graphical user interface also allows easy use by researchers not proficient in the statistical programming language R.