NEPSscaling is an R package containing functions to facilitate the estimation of PVs for competence domains measured in the NEPS while handling missing values in the background model. Other functions allow the inspection of the specific CARTs used for imputation, accessing parts of the resulting NEPSscaling R object, and exporting PVs for different statistical software. The user can also call information about the implemented competences tests and assessment waves, as well as differences between the package and the pre-calculated competence measures published in the NEPS scientific-use files (SUFs). The SUFs include the prepared survey and test data in a factually anonymized form. To download the SUFs for research purposes, only a data use agreement needs to be signed.
Availability
The package NEPSscaling is not available from CRAN, but is provided by the NEPS research data center at https://www.neps-data.de/Data-Center/Overview-and-Assistance/Plausible-Values. The package is free to download and previous package versions and example code that shows how to use the package to estimate PVs in different cohorts are also available. To install the package, the following command can be used
Basic functions
In the following the most important functions are described in the order in which they would occur during a typical use of the package.
The functions currently_implemented() and deviations_of_package_from_suf() require no arguments and give an overview of the current state of NEPSscaling. The former shows which competence tests are available for which starting cohort by domain and wave, while the latter reports known deviations in comparison to the point estimates of the latent competences (WLEs) provided in the NEPS SUFs.
The main function for generating PVs is plausible_values() which loads the raw data from the scientific use files, imputes missing values in the background data, creates the appropriate scaling model for the chosen competence test, and estimates either cross-sectional or longitudinally linked PVs.
The function expects several arguments specifying the PV estimation; most of them are optional:
-
SC (required): The starting cohort is given by specifying its integer equivalent (e.g., the adult cohort is listed as starting cohort 6).
-
domain (required): The chosen competence domain is indicated by the two or three letter abbreviations summarized in Fuß et al. (2021). Because not all competence domains have been assessed in each cohort, users have to specify the correct combination of SC and domain as indicated by currently_implemented().
-
wave (required). The assessment wave is given by an integer value as summarized in Fuß et al. (2021). For example, the tests of the starting cohort 6 (adults) took place in the waves 3, 5, and 9.
-
path (required): Because the function automatically loads the relevant data from the scientific use files, the path to the data on the hard drive needs to be specified as a string (e.g., “C:/Users/name/NEPS_data/” on a Windows machine).
-
bgdata (optional): The background data needs to be provided as a data.frame containing the person identifier ID_t. If no background data is provided, PVs without a background model are estimated. Note that the package automatically includes the number of not-reached missing values as a proxy for processing times and, if the assessment took place in the school context, the mean competence per school as a proxy for the multi-level sampling design; this default setting can be changed using the arguments include_nr and approximate_school_context explained below.
-
npv (optional): The number of randomly drawn PVs can be explicitly set, but defaults to a value of 10. Importantly, only npv PVs are returned even if more sets are estimated.
-
nmi (optional): The number of randomly drawn imputed background data sets can be explicitly set, but defaults to a value of 10.
-
min_valid (optional): PVs are only estimated for respondents that provided a minimum number of valid (i.e., non-missing) responses (default: 3).
-
longitudinal (optional): The logical argument indicates whether cross-sectional PVs for the specified wave or longitudinally linked PVs of the specified cohort should be estimated. In the longitudinal case, all available waves of the specified cohort are included in the estimation.
-
rotation (optional): The logical argument indicates whether the test rotation design should be considered in the cross-sectional case, thus, estimating a multi-faceted model.
-
include_nr (optional): The logical argument specifies whether the number of not-reached items (i.e., missing values) should be included in the background model as a proxy for test taking effort.
-
adjust_school_context (optional): The logical argument controls whether the average point estimate (WLE) of each school of the competence should be calculated and included in the background model to approximate the nested sampling scheme in school assessments.
-
exclude (optional): Some variables included in the supplied background data can be excluded from the estimation model of the PVs and only be used for imputing missing values. In the cross-sectional case, the argument is a character vector (e.g., c(“var1”, “var3”)), while it must be a named list (e.g., exclude = list(w1 = “var1”, w3 = c(“var1”, “var3”))) in the longitudinal case specifying the excluded variables for each wave.
-
seed (optional): For reproducibility, the specific seed can be set for the random number generators.
-
control (options): The list can contain logicals informing whether point estimates in the form of WLEs and expected a posteriori estimates (EAPs) should be returned. Additional arguments are passed on to the estimation algorithm in TAM’s tam.mml() and tam.pv() functions. List of additional options: If EAP = TRUE, the EAPs will be returned as well; for WLE = TRUE, WLEs are returned. Furthermore, additional control options for are collected in the list ‘ML‘. ‘minbucket‘ defines the minimum number of observations in any terminal CART node (defaults to 5), ‘cp‘ determines the minimum decrease of overall lack of fit by each CART split (defaults to 0.0001).
.
After the estimation of PVs, the functions print(x) and summary(object) give a quick overview of the specified model and estimated model parameters. The only required argument is the R object resulting from using plausible_values(). To facilitate the exploration of the resulting R object, the package also contains a number of extraction functions such as get_domain(pv_obj), get_info_criteria(pv_obj), get_pv_list(pv_obj), or get_pv_index(pv_obj, index). Moreover, the CART imputation can be visualized with display_imputation_tree(pv_obj, imputation, variable) that generates a plot displaying the specific tree constructed to impute a single variable. If the graphical representation becomes too complex, a character representation of the tree can be inspected using get_imputation_tree(pv_obj, imputation, variable).
The package also provides means to easily export the estimated PVs together with their imputed background data in case analyses with the PVs are to be conducted using different software. The write_pv() function takes the arguments pv_obj, that is, the resulting R object, path where the data is to be stored and ext, a string indicating the storage format (i.e., SPSS, Stata, or Mplus).
Typical workflow
Estimating PVs for competence tests in the NEPS typically follows several consecutive steps that depend on two data sources. First, the SUFs including the raw competence test data need to be obtained from https://neps-data.de. Data access requires a free, non-commercial data use agreement with the NEPS research data center.Footnote 2 This data is necessary to estimate the IRT part of the plausible values model and needs to be stored in a way that it is accessible to the current R session. Ideally, all raw data files can be found in the same folder. Second, the carefully selected background variables for the PV estimation need to be prepared by the user to ensure congeniality with the intended analyses. This data preparation can be done using any statistical software and the user is advised to check the data for plausibility before starting any further analyses. The only requirement for using NEPSscaling is that the resulting background data is stored in tabular format either in SPSS’s sav format, Stata’s dav, or R’s rds format (when using the graphical user interface) or that it is imported into the current R session as a data.frame (when invoking the estimation using the R script). Special attention needs to be paid to missing data in the background data as they must be coded as R’s NA values and categorical variables have to be converted to R’s factors. Further, the selected background variables should either be assessed at the same time point for which the PVs will be estimated or include time constant information to avoid inconsistencies. After the preparation of the background data, PVs can be estimated via an R script and the functions outlined above or via the graphical user interface provided by the NEPSshiny app.
Last, NEPSscaling versions always depend on different versions of the SUFs because the competence variables in the SUFs are addressed by the package’s functions. Therefore, if variable names are changed in the SUFs, they are changed accordingly in the newest version of the package. As a consequence, the names of newer SUF versions and older package versions and vice versa are no longer compatible. Thus, it is recommended to always use the latest versions of both SUF and package to ensure a match. Further, it is advised to state package versions explicitly to ensure reproducibility.
Surrounding NEPSscaling
The estimation of PVs can and should be conducted with several things in mind: First, the imputation model may unduly impact the results and thus, conducting sensitivity analyses for different imputation models or evaluating the efficiency gains obtained through using PVs is advisable. Second, there are particularities in working with plausible values. For example, it is necessary to conduct any further analyses (e.g., regression analysis) which use PVs separately for each set of PVs. The results of these analyses then need to be pooled using Rubin’s rules or any other appropriate pooling procedure (Raghunathan et al., 2003). For further information how to correctly work with PVs see von Davier et al. (2009) and for further information on the estimation of PVs as well as an example for pooling the results of Scharl et al. (2020).