Bayesian historical borrowing with longitudinal large-scale assessments

The purpose of this paper is to extend and evaluate methods of Bayesian historical borrowing applied to longitudinal data with a focus on parameter recovery and predictive performance. Bayesian historical borrowing allows researchers to utilize information from previous data sources and to adjust the extent of borrowing based on the similarity of current data to historical data. We examine the utility of three static historical borrowing methods including complete pooling, Bayesian synthesis with aggregated data-dependent priors, traditional power priors, and two dynamic borrowing methods including Bayesian dynamic borrowing and commensurate priors. Using data from two administrations of the United States Early Childhood Longitudinal Study, we evaluate these methods in terms of in-sample simulation statistics, as well as pseudo out-of-sample measures of predictive performance. A case study examining growth in reading competency over time revealed that for one historical cycle, most methods of historical borrowing perform similarly with respect to predictive performance and parameter recovery. Pooling and power priors performed relatively poorly across the conditions in this study, particularly when the current data and the historical data were heterogeneous. Results from a comprehensive simulation study revealed that the advantages of different historical borrowing methods vary across different evaluation criteria. Overall, Bayesian dynamic borrowing and commensurate priors are no worse, and in some cases better, than other methods in terms of parameter recovery and predictive performance, and considering a previous paper by Kaplan et al. (Psychometrika, 10.1007/s11336-022-09869-3, 2022) found clear benefits of Bayesian dynamic borrowing over other methods of historical borrowing in the multilevel context using data from the Program for International Student Assessment (PISA) with five historical cycles, this paper argues that Bayesian dynamic borrowing or commensurate priors is a prudent choice for borrowing information from previous cycles of longitudinal data.


Introduction
With the increased availability of large-scale longitudinal surveys and assessments, researchers can address critical questions of growth and development. Such endeavors as the German National Educational Panel Study (NEPS) (Blossfeld & Roßbach, 2019) or the US Early Childhood Longitudinal Studies Program (ECLS; NCES, 2018) provide extensive data on the academic, behavioral, and social development of children throughout the school years and beyond. These data can be used to estimate rates of change in relevant academic and non-academic outcomes over the waves of the study as well as to model variation in rates of change over children as a function of a very large number of covariates collected on children, their families, and their schools.
In addition to providing estimated rates of growth in academic and non-academic outcomes, longitudinal studies can be leveraged for the purposes of forecasting growth beyond the extant waves of the study. Although such uses are rare in education (see however Kaplan & Huang, 2021), it still remains an important use of longitudinal data. This paper is concerned with the question of whether it is possible and desirable to borrow information from an analysis of the waves of a previous cycle of longitudinal data to inform the analysis of the waves of a current longitudinal study, particularly with respect to the estimation of growth and predictive performance. With a focus on the ECLS Kindergarten cohorts (ECLS-K), the question concerns specifically whether estimation of the growth rates in reading performance from the 2010 to 2011 cohort (referred to as ECLS-K:2010-11) can be improved by systematically including information about the growth rates from the 1998 to 1999 cohort (referred to as ECLS-K:1998-99).
The methods we examine in this paper are generally referred to as Bayesian historical borrowing, a class of procedures that have long been applied in the clinical trials field (e.g. Pocock, 1976;Hobbs et al., 2011;Hobbs et al., 2012;Schmidli et al., 2014;Viele et al., 2014), and recently developed and applied to large-scale and cross-sectional educational data (see Kaplan et al., 2022). Two general approaches to historical borrowing can be identified in the literature. The first are static borrowing methods where prior strength does not automatically vary based on the similarity between the historical data and the current data. For example, with static borrowing, fixed prior strength might be based on a researcher's judgement regarding the similarity between the historical data and current data, but this prior strength would not automatically be adjusted based on the heterogeneity between historical data and current data to supplement the researcher's judgement. Methods under static borrowing that we will study in this paper include pooling, also known as integrative data analysis (Bainter & Curran, 2015;Curran & Hussong, 2009), Bayesian synthesis using augmented or aggregated data-dependent priors (Marcoulides, 2017), and power priors Chen et al., 2000;Chen et al., 2015).
In contrast to static borrowing, dynamic borrowing methods allow for a joint prior distribution to be specified over both the historical and current data to encode the researcher's judgement regarding the similarity between the historical data and the current data. The similarity is controlled by the variance of the joint prior distribution. Methods under dynamic borrowing that will be examined in this paper include Bayesian dynamic borrowing (see Viele et al., 2014;Kaplan et al., 2022) and commensurate priors (Hobbs et al., 2011(Hobbs et al., , 2012. For a review of the static and dynamic borrowing methods examined in this paper, one may refer to Kaplan et al. (2022).
The organization of this paper is as follows. In the next section, we specify growth curve modeling as a Bayesian hierarchical model. Although growth curve modeling is not the only method that can be applied to longitudinal data, we focus on this method because of its extensive use and its flexibility in addressing specific issues of growth. In the following two sections, we extend static and dynamic borrowing methods to growth curve models. Following this, we then discuss the methodology for how to evaluate such models -particularly using statistics that assess in-sample and pseudo out-of-sample predictive performance. This is then followed by the design and results of our case study and simulation study. Future research directions are outlined in the Conclusions section.

Bayesian growth curve modeling
To fix terminology and notation, we use the term cycle to refer to the different survey administrations (i.e. ECLS-K:1998-99 andECLS-K:2010-11), and use the term wave to refer to the repeated measurements within the cycle. Let superscript 0 represent the current cycle, ECLS-K:2010-11, and h ( h = 1 . . . H ) represent the historical cycle(s), where for our paper H = 1 with ECLS-K:1998-99 as the only historical cycle. The level-1 (within-student) model can be written as follows. Let where y 0 ig be a T × 1 vector of T waves of measurement for student i ( i = 1, . . . , n g ) in school g ( g = 1, . . . G) ; 0 be a T × Q matrix of fixed constants that, for notational simplicity, are assumed to be the same for all participants across all schools. These fixed constants serve to parameterize the growth model as a structural equation model (see Willett & Sayer, 1994). Further, let η 0 ig be a Q × 1 vector of random growth parameters for student i in school g. In our study, Q = 3 , representing the intercept, linear growth component, and quadratic growth component for each student in each school; and ǫ 0 y ig is a T × 1 vector of residuals, with diagonal covariance matrix 0 y . Here, we assume constant variances across students and schools, and so The level-2 (between-student) model allows the random growth parameters to be related to a set of time-invariant predictors. The level-2 model can be written as where Ŵ 0 g is a Q × P matrix of regression coefficients associated with the time-invariant predictors which vary over schools, x 0 ig is a P × 1 vector of time-invariant predictors whose values vary over students and schools, and ǫ 0 η ig is a Q × 1 (i.e., intercept, slope and quadratic growth component in our study with Q = 3 ) vector of residuals with symmetric covariance matrix 0 η . We assume constant variances across students and schools, and so (1) Page 4 of 30 Kaplan et al. Large-scale Assessments in Education (2023) 11:2 which allows for the random growth parameters (i.e., intercept, slope, and quadratic components) to be correlated conditional on the predictors as shown in Eq. (4). The assumption of constant 0 η could, in principle, be relaxed. The level-3 (between-school) model can be written as where vec(·) is the vectorization operator that turns a matrix into a long column vector, 0 is a QP × M matrix of between-school regression coefficients, w g is a M × 1 vector of school-level predictors, and ǫ 0 Ŵ g is a QP × 1 vector of residuals with covariance matrix 0 Ŵ . We assume constant variances across schools, and so

Bayesian hierarchical growth curve model
A Bayesian hierarchical specification of the growth curve model in Eqs.
(1-6) can be written as where 0 and 0 are fixed and known parameters. Prior distributions on the residual covariance matrices are assumed to be inverse-Wishart (IW). For the current data, following the notation in Eq.
(2), we assume all the diagonal elements of 0 y are equal to (σ 0 y ) 2 , where σ 0 y indicates the standard deviation of level-1 residual. The prior distributions for the variation parameters are specified as follows: Page 5 of 30 Kaplan et al. Large-scale Assessments in Education (2023) 11:2 Static borrowing for growth curve models Extensions of growth curve models to pooling and to aggregated data-dependent priors are relatively straightforward. In particular, assuming the conditions for pooling of longitudinal data are met (see Hofer & Piccinin, 2009), pooling of data would be relatively straightforward. Similarly, the aggregated data-dependent prior approach of Marcoulides (2017) would require obtaining averages of the parameters of interest from the historical cycles and implementing them as the hyperparameters of informative prior distributions for the current cycle. This section concentrates instead on expanding the power prior to longitudinal data.

The power prior for growth curve models
Consider again the Bayesian hierarchical specification of the GCM models in Eqs. (7a-7d). Among the entire set of model parameters, the top level parameters in the growth curve models are the common parameters across cycles that will borrow from historical data through power priors. The lower level parameters are cycle specific and thus there is no direct borrowing. The probability distribution of the historical data given both the common and the unique parameters across cycles can be written as From here, the power prior can be expressed as where a controls the strength of borrowing from historical data on p( | y 1 , . . . , y h , a) . Notice that when a = 0 , the prior does not depend on the historical data, and when a = 1 , the prior is the posterior distribution from the previous study.

Dynamic borrowing for growth curve models
As noted earlier, static borrowing methods do not incorporate information about the current cycle into the prior specification. In contrast, dynamic borrowing methods do incorporate the current cycle into the prior specification of the model parameters. In this section, we extend methods of dynamic borrowing to growth curve models, concentrating on Bayesian dynamic borrowing and commensurate priors.

Extensions of Bayesian dynamic borrowing to growth curve models
We adapt the cross-sectional multilevel modeling notation of Kaplan et al. (2022) to the case of growth curve models. We begin by borrowing from historical cycles to estimate the growth parameters. This requires defining a joint distribution of the growth parameters over the historical cycles (denoted as cycles 1 to H) and the current cycle (denoted as cycle 0), which is assumed to be a multivariate Guassian distribution with Page 6 of 30 Kaplan et al. Large-scale Assessments in Education (2023) 11:2 the Q(H + 1) × 1 mean vector µ η ig and Q(H + 1) × Q(H + 1) block-diagonal covariance matrix T η .
where following Eq. (7b), H) represents a Q × P vector of the time-invariant regression coefficients for the h th historical cycle and Ŵ 0 g represents a Q × P vector of the current time-invariant regression coefficients.
The covariance matrix of the random growth parameters can be written as where each element of Eq. (12) is a symmetric matrix as in Eq. (4). Next, the joint distribution of Ŵ 0 g , Ŵ 1 g , . . . , Ŵ H g are assumed to be multivariate Gaussian with the QP(H + 1) × 1 mean vector µ Ŵ g and QP(H + 1) × QP(H + 1) block-diagonal covariance matrix T Ŵ -viz.
where following Eq. (7c), µ Ŵ g = vec � 0 w 0 g , � 1 w 1 g , . . . , � H w H g , h (h = 1, . . . H) represents a QP × M matrix of the school-level regression coefficients for the h th historical cycle, and 0 represents a QP × M matrix of the school-level regression coefficient for the current cycle.
The covariance matrix of the time-invariant regression coefficients over the current and historical cycles, T Ŵ , is specified as being block diagonal, where the elements of T Ŵ contain the variances and covariances of the regression coefficients within each historical or current data set. We assume that elements outside the block diagonal of T Ŵ are null matrices.
Finally, the joint distribution of 0 , 1 , . . . , H is also assumed to be multivariate Gaussian with a QPM × 1 mean vector µ and QPM × QPM covariance matrix T -namely The covariance matrix T can be diagonal with elements τ 2 , which controls the degree of borrowing across cycles. Note that 0 , 1 , . . . , H follows the same mean vector µ and covariance matrix T as shown in Eq. (15) and thus the elements of µ and T are not cycle specific, indicating that borrowing across cycles takes place at the top level of the hierarchy. Hobbs et al. (2011) proposed dynamic versions of power priors, referred to as commensurate power priors, where the coefficient used to downweight the historical data is viewed as random and estimated based on a measure of the agreement between the current and historical data. Hobbs et al. (2011) also proposed general commensurate priors where the prior mean for the current parameters of interest is conditioned on the historical population mean and the prior precision τ , referred to as the commensurability parameter, which reflects the commensurability between the current and historical parameters. 1 Hobbs et al. (2011) evaluated commensurate priors in a scenario of borrowing one historical trial to analyze a single-arm trial, that is, assuming there is only one historical study and one parameter of interest, β . The location parameter or mean for β is µ 0 β for the current data and µ H β for the historical data. Then the commensurate prior for µ 0 β can be specified as µ 0 β ∼ N(µ H β ,1/τ). As discussed in Hobbs et al. (2012), the commensurate prior in Hobbs et al. (2011) suffers from the fact that diffuse priors could actually become undesirably informative and that the historical likelihood is considered as a component of the prior rather than data. Therefore, Hobbs et al. (2012) proposed a modified commensurate prior that incorporates historical data as part of the likelihood for the current parameter estimation and employs empirical and fully Bayesian modifications for estimating the commensurate parameter τ (e.g., as illustrated in Eq. 1 of their paper). They also extended the method to general and generalized linear mixed regression models in the context of two successive clinical trials.

Commensurate priors
The modified commensurate priors approach in Hobbs et al. (2012) was compared to several meta-analytic models where priors for the historical parameters and current parameters were jointly modeled, but historical data were not incorporated in the likelihood of the current parameter estimation and thus the priors were not commensurate or dynamic. Commensurate priors were shown to provide more bias reduction compared to the meta-analytic approaches they evaluated. The bias reduction was larger when there was only one historical study compared to when there were two or three historical studies.
Although Kaplan et al. (2022) extended Bayesian dynamic borrowing to cross-sectional single-level and multilevel models with covariates, they did not examine commensurate priors. For this paper, we consider the modified commensurate prior in Hobbs et al. (2012) and implement it in the multilevel setting with multiple historical studies in the following way. For regression coefficients, we assume: where β 1 , . . . , β H are regression coefficients of interest for each historical cycle. Although regression coefficients are likely to be different within a historical cycle, the common regression coefficients (e.g., intercepts) are assumed to be equal across historical cycles and denoted as β Hist , where β Hist can be given a vague Gaussian prior.
The parameters for the current cycle (denoted as cycle 0) follow a prior distribution with the historical regression coefficients as the prior mean as follows: where � β can be specified as a diagonal matrix, diag (σ 2 1 , . . . , σ 2 P , ) , for P regression coefficients, and each element of the diagonal matrix can be provided its own prior distribution, such as inverse-gamma (IG), half-Cauchy (see e.g. Gelman, 2006), spike-and-slab (Mitchell & Beauchamp, 1988), etc.
Considering a two-level setting such as students nested in schools, the priors for the school-level covariance matrices for historical cycles can be specified as follows: where 1 , . . . , H are covariance matrices for each historical cycle. The common elements of the covariance matrices (e.g., variances) are assumed to be equal across historical cycles and denoted as Hist , and is a correlation matrix following the Lewandowski et al. 2009 (LKJ) prior. 2 For the current cycle (denoted as cycle 0), the inverse of covariance matrix can follow a prior distribution conditioning on the historical covariance matrix as For multilevel settings with three levels or more, the higher level covariance matrices may follow the similar prior specifications as the above.
Note that Bayesian dynamic borrowing differs from commensurate priors insofar as the joint prior distribution for the former contains the current cycle, while commensurate priors place a prior on the common historical parameters of interest first and then the current parameter has a prior distribution with the historical regression coefficients as the mean as shown in Eq. (17).

Extensions of commensurate priors to growth curve models
Our extension of commensurate priors to growth curve models closely follows the notation for commensurate priors in multilevel settings. We consider a multilevel growth curve model with multiple time points within an individual and individuals nested in groups such as students nested in schools. To simplify the notation, we stack regression coefficients of the growth curve model, including growth parameters and regression coefficients of individual-level time-invariant predictors and group-level predictors, together to be β . We let I and G denote the corresponding individual-level and grouplevel covariance matrices. The prior specification for historical regression coefficients β 1 , . . . , β H can follow those in Eq. (16) and the prior specification for current regression coefficients β 0 can follow those in Eq. (17). Similarly, the prior specification for historical covariance matrices 1 I , . . . , H I and 1 G , . . . , H G can follow those in Eq. (18) and the prior specification for current covariance matrices 0 I and 0 G can follow those in Eq. (19). We introduce a modification to the estimation of the commensurate prior for this study. Instead of using the spike-and-slab prior (Mitchell & Beauchamp, 1988) used by Hobbs et al. (2012) for the commensurability parameter, for computational simplicity and numerical stability, we utilize an extension of the horseshoe prior (Carvalho et al., 2010) developed by Piironen and Vehtari (2017) to account for commensurability. The horseshoe prior is a global-local shrinkage prior that combines together two priors: a global prior for all of the coefficients in the current cycle, which has the effect of shrinking all coefficients toward historical coefficients, and a local prior for each of the predictors in the current cycle, which has the effect of relaxing the shrinkage due to the global prior for coefficients that are away from historical coefficients.
Following the notation in Betancourt (2018), the horseshoe prior for the p th element of β 0 can be specified as follows: where τ 0 is a hyperparameter that controls the behavior of the global shrinkage prior τ (Carvalho et al., 2010). 3 A limitation of the conventional horseshoe prior relates to the regularization of the large coefficients. Specifically, it is still the case that large coefficients can transcend the global scale set by τ 0 with the impact being that the posteriors of these large coefficients can become quite diffused, particularly in the case of weakly-identified coefficients (Betancourt, 2018). To remedy this issue, Piironen and Vehtari (2017) proposed a regularized version of the horseshoe prior (also known as the Finnish horseshoe prior) that has the following form: where s 2 is the variance for each of the p predictor variables, assumed to be constant, and c is the slab width. The hyperparameters of the inverse-gamma distribution in Eq. (21d) induce a Student-t ν (0, s 2 ) distribution for the slab (see Piironen & Vehtari, 2017, for more detail). For our paper, two changes were implemented to the regularized horseshoe. First, we set the mean of β 0 p to β hist p rather than to zero in order for shrinkage to be toward the historical mean, i.e., β 0 p ∼ N (β hist p , τ˜ p ) . Second, we set s 2 = 1 due to standardization of the data.

Evaluating predictions under historical borrowing
For this paper, we evaluate historical borrowing for longitudinal data using two distinct approaches. The first is based on classic approaches developed in the econometrics literature based on so-called in-sample simulations wherein the growth record predicted by the model under various methods of borrowing is compared to the actual growth record. This approach was used by Kaplan and George (1998) as a general framework for the evaluation of frequentist growth curve models without relying on conventional goodness-of-fit tests. The second approach is based on the use of scoring rules to evaluate the overall predictive accuracy of probabilistic forecasts. We refer to these methods as pseudo out-of-sample performance measures because they will be used to compare the predicted distribution of the outcome from the last wave of the cycle to the known distribution of the outcome.

In-sample simulations
Following closely the discussion in Kaplan and George (1998) in the context of latent variable growth models, an alternative form of model assessment that does not rely solely on conventional goodness-of-fit statistics in structural equation models (see e.g. Kaplan, 2009) concerns how well the model can reproduce the known growth trajectory. In the context of economic forecasting, this type of model evaluation is referred to as insample simulation (Pindyck & Rubinfeld, 1991). Given the estimated parameters of the model and estimated average values of any exogenous predictors, in-sample simulations can be used to predict the known growth record. These in-sample simulation statistics, to be described below, can then be applied to assess how well the model-based predictions fit the known growth record. The result of such an exercise is a form of model evaluation that goes beyond simply assessing overall goodness-of-fit and considers the utility Page 11 of 30 Kaplan et al. Large-scale Assessments in Education (2023) 11:2 of the model for some other purpose. In this case, one might be concerned with simulation adequacy when considering the use of these models for subsequent forecasting. Based on the early work of Theil (1966, see also;Pindyck & Rubinfeld, 1991;Kaplan & George 1998), we use three in-sample simulation measures to assess how well the fitted growth trajectory compares to the actual growth trajectory. While other measures are available, these three are considered classic measures in the econometric literature.

Theil's inequality coefficient
The first measure we consider is Theil's 1966 inequality coefficient, defined as where T is the number of time periods, y s is the simulated (i.e., predicted) value at time t and y a is the actual value at time t. In the context of this paper, y s is the model based predicted value of y at time t under different scenarios of dynamic borrowing, and y a is the actual mean (over individuals) of y at time t for the current cycle ECLS-K:2010-11. From Eq. (22), it can be seen that a value of U = 0 indicates a perfect fit of the simulated growth record to the actual growth record. On the other hand, if U = 1 , the simulation adequacy is as poor as possible.

Theil's bias proportion
As shown by Theil (1966) and also discussed in Pindyck and Rubinfeld (1991), the inequality coefficient can be decomposed into measures that provide different perspectives on the quality of simulation performance. The first component of Theil's U is the bias proportion, defined as where ȳ s and ȳ a are the means of the simulated and actual growth record, respectively, calculated across the T time periods. The bias proportion provides a measure of systematic error because it considers deviations of average actual values from average simulated values. The ideal would be a value of U B = 0. A suggested rule of thumb is that values over 0.1 or 0.2 are indicative of systematic bias (Pindyck & Rubinfeld, 1991, p. 341).

Theil's variance proportion
Another component of Theil's U is the variance proportion defined as where σ s and σ a are the standard deviations of the simulated and actual growth records, respectively, calculated across the T time periods. The variance proportion provides a measure of the extent to which the model tracks the variability in the growth record. As noted by Pindyck and Rubinfeld (1991), if U V is large, it suggests that the actual (or simulated) growth record varies a great deal while the simulated (or actual) growth record does not deviate by a comparable amount. 4

Pseudo out-of-sample scoring rules for probabilistic forecasts
A central characteristic of statistics is to develop accurate predictive models (Dawid, 1984). Indeed, as pointed out by Bernardo and Smith (2000), all other things being equal, a given model is to be preferred over other competing models if it provides better predictions of what actually occurred. Thus, a critical component in the development of accurate predictive models is to decide on rules for gauging predictive accuracy-often termed scoring rules. Scoring rules provide a measure of the accuracy of probabilistic forecasts, and a prediction can be said to be "well-calibrated" if the assigned probability of the outcome matches the actual proportion of times that the outcome occurred (Dawid, 1982). A large number of scoring rules have been reviewed in the literature (see e.g., Winkler, 1996;Bernardo & Smith, 2000;Jose et al., 2008;Merkle & Steyvers, 2013;Gneiting & Raftery, 2007). Here, however, we focus on two strictly proper scoring rules that are commonly used to evaluate predictive distributions: the Kullback-Leibler divergence score and expected log point-wise predictive density implemented through leave-one-out cross-validation.

Kullback-Leibler Divergence score
For this paper, we evaluate the quality of predictions using the Kullback-Leibler Divergence (KLD) score (Kullback & Leibler, 1951;Kullback, 1959;1987). Consider two distributions, p(y) and g(y|θ) , where p(y) could be the distribution of observed reading literacy scores, and g(y|θ) could be the prediction of these reading scores based on a model. The KLD between these two distributions can be written as where KLD(f, g) is the information lost when g(y|θ) is used to approximate p(y). For this paper, the actual reading outcome scores will be compared to the predicted outcome using different methods of borrowing from historical data, including not borrowing historical data at all. The model with the lowest KLD measure is deemed best in the sense that the information lost is lowest when approximating the actual reading outcome distribution with the distribution predicted on the basis of the model.

Leave-one-out cross validation information criterion (LOOIC)
In addition to KLD, we also examine the predictive performance of Bayesian historical borrowing methods using the leave-one-out cross validation information criterion (LOOIC). Leave-one-out-cross-validation (LOOCV) is a special case of k-fold cross-validation (k-fold CV) when k = n , with n indicating the number of observations. In k-fold CV, a sample is split into k groups (folds) and each fold is taken to be the validation set with the remaining k − 1 folds serving as the training set. For LOOCV, each observation serves as the validation set with the remaining n − 1 observations serving as the training set. For each observation, the expected log point-wise predictive density (ELPD) is calculated and serves as a score of the predictive accuracy for n data points taken one at a time (see Vehtari et al., 2017) 5 An information criterion referred to as the LOOIC can then be obtained as function of the estimated ELPD. Among a set of competing models, the one with the smallest LOOIC is considered the best from an out-of-sample pointwise predictive point of view. For this paper, we evaluated two different LOOIC indices with students being left out one at a time and with schools being left out one at a time.
We obtain the Bayesian LOOIC provided by the loo software program , available in R (R Core Team, 2022).

Data source: the Early Childhood Longitudinal Study
This paper will utilize data from the two extant cycles of the Early Childhood Longitudinal Study (ECLS). Specifically, we will focus on the ECLS kindergarten cohort of 2011 (ECLS-K:2010-11) and utilize the ECLS kindergarten cohort of 1998-99 as prior information to inform our simulation studies as well as using these data sources for our case study. The ECLS program was sponsored by the National Center for Education Statistics (NCES) and is a component of the NCES Longitudinal Studies Program. The main purpose of ECLS is to provide policymakers, researchers, and the interested community at large with a rich description of children's early experiences in school.

ECLS-K:1998-99
The database used in this paper to provide priors for the analysis of ECLS-K  (2015), and the spring of fifth grade (2016), 9 time points in total with 6 time points being the same grade levels as those collected in the ECLS-K:1998-99 study (i.e., fall and spring of kindergarten, fall and spring of first grade, the spring of third grade and the spring of fifth grade). Note that although the study refers to later rounds of data collection by the grade the majority of children were expected to be in (that is, the modal grade for children who were in Kindergarten in the 2010-11 school year), children were included in subsequent data collections regardless of their grade level.

Design of the case study
We conducted a comprehensive case study to evaluate the predictive performance of various Bayesian historical borrowing methods via a growth curve model with longitudinal assessments. For the historical cycle ECLS-K:1998-99, we used the data from six time points that are in common with ECLS-K:2010-11, including the fall and spring of kindergarten, the fall and spring of first grade, the spring of third grade and the spring of fifth grade. The spring of eighth grade is not used considering that this time point was not collected for ECLS-K:2010-11.
We evaluated the effects of socio-economic status (SES) at the student level and the percentage of free lunch eligible students (denoted as FLunch) at the school level on the growth trajectory of students' reading scores from fall kindergarten to spring of third grade and then used the model to predict the reading score of fifth grade for the current cycle of ECLS-K:2010-11. Missing data was addressed by performing multilevel multiple imputation for each cycle separately using the Blimp software program (Enders et al., 2018;Keller & Enders, 2019). For simplicity, we used the first imputed data set for our analyses. 6 Summary statistics for SES and free lunch eligible student percentage as well as for reading scores at different time points for ECLS-K:1998-99 and ECLS-K:2010-11 are shown in Table 1. The time points are coded based on the number of semesters from the fall of kindergarten with the fall kindergarten denoted as time point 0. For example, the spring of kindergarten is one semester away from the fall of kindergarten and thus its time point is 1. For ECLS-K:2010-11, we used the first eight time points (Time 0 to 5, 7 and 9) to estimate growth curve parameters via different historical borrowing methods and then evaluated their performance on predicting students' reading scores at the last time point (Time 11), which is the spring of fifth grade.

Model specification
A Bayesian multilevel growth curve model is fit with reading scores at different time points as level-1, student as level-2, and school as level-3, which is consistent with the design of both cycles. The reading score is the outcome, SES is the student-level predictor, and percentage of free lunch eligible students is the school-level predictor. The intercept, linear and quadratic terms of time were included in the model to evaluate students' starting points, growth rates and acceleration rates of reading achievement. The interaction between linear and quadratic terms with SES were also included. The student-level intercept and slope (starting point and growth rate) and the school-level intercept were allowed to be random (Bollen & Curran, 2006;Kaplan, 2009). As the scales of variables included in the models vary greatly, all the variables were standardized first and their z-scores were used in the estimation. Then, all the parameters were converted back to their original scales after the estimation.

Sample size
We evaluated the performance of different priors using the sample of female students only ( N = 1861 ). The results for the male students were virtually identical across all conditions of the case study and are available in the supplementary material. To evaluate the impact of sample size on the performance of different Bayesian historical borrowing methods, in addition to the full female sample, a small subsample of female students from high poverty schools was obtained (defined as 75% or above of students in a school who are free lunch eligible). The female subsample has N = 620 students.

Choice of priors
We evaluated the performance of dynamic priors, which incorporate the potential heterogeneity between historical data and current data through a joint prior distribution, and compared it to regular priors with predetermined prior values and strength. Specifically, for dynamic priors, we varied the IG prior for τ 2 at IG (1, 1) and IG (1, 0.001) to allow for different degrees of borrowing for coefficients. Moreover, the precision matrix of the random intercept and random slope has a Wishart distribution prior 7 , W (ν, νS −1 ) where ν takes on the values 2 (weak borrowing) or 20 (strong borrowing) and S = ′ S S is the baseline covariance matrix where S is a diagonal matrix whose diagonal elements are distributed as half-Cauchy(0,1) and ∼ LKJCorr(1) (Lewandowski et al., 2009). For commensurate priors, we also used a Wishart prior for the precision matrix of the random intercept and random slope, W (ν, νS −1 ) , where ν takes on the value of 2. For power priors, we varied the a parameter using values of .25, .50 and .75. For aggregated data-dependent priors, the estimated coefficients from historical data were used as the prior mean and the prior variances of historical coefficients were used as the prior variances. For comparison purposes, two extreme kinds of borrowing, complete pooling and no borrowing of the historical data sets were also examined. We specified a weakly informative half-Cauchy (0,1) prior for the standard deviation σ of the individual-level error term, and a non-informative N (0, 10 2 ) prior for the school-level coefficients across all cycles ( Ŵ 0 ) in BDB and the mean school-level coefficients in the current cycle ( µ ) in the non-informative prior conditions. All analyses were conducted within the R Page 17 of 30 Kaplan et al. Large-scale Assessments in Education (2023) 11:2 programming environment (R Core Team, 2022) using rstan (Stan Development Team, 2021). All code for the case study are available in the supplementary material.

Results of the case study
As mentioned earlier, there are two different scenarios evaluated in the case study, namely, female students in high poverty schools and the full sample of female students. For these scenarios, results are presented in two tables with the first for regression coefficient and variation estimates and the second for predictive performance. These results are presented in Tables 2 and 3 for female students in high poverty schools, and Tables 4  and 5 for all female students, respectively. Across different borrowing methods, Bayesian dynamic borrowing and commensurate priors provided similar coefficient and variation estimates to those with no borrowing, indicating that the historical data and current data are heterogeneous. Pooling and power priors with greater amounts of borrowing (i.e., a = 0.5 and a = 0.75 ), on the other hand, provided similar coefficient and variation estimates and showed differences from no borrowing and dynamic borrowing priors, particularly on the coefficient and variation estimates of student-level and school-level intercepts.
In terms of predictive performance, overall, different borrowing methods performed similarly for full samples and high poverty school samples. Theil's equality coefficients were nearly identical, with pooling and power priors having slightly smaller variance proportions. Pooling, aggregated data-dependent priors, and power priors had relatively smaller RMSE between predicted and observed reading scores at the spring of 5th grade and smaller KLD. No borrowing provided larger RMSE and KLD, but the smallest LOO-ICs. Bayesian dynamic borrowing under IG(1, 0.001) had a smaller LOOIC compared to pooling and power priors. Across all the prediction evaluation criteria, aggregated datadependent priors performed well overall (i.e., smaller RMSE and KLD compared to no borrowing and smaller LOOICs compared to pooling, power priors and dynamic borrowing methods). Due to the heterogeneity between the historical data and the current data as reflected in Table 1, Bayesian dynamic borrowing methods did not outperform other borrowing methods, but would still be a reasonable choice in terms of providing smaller LOOICs compared to pooling and smaller RMSEs compared to no borrowing.

Design of the simulation study
The results of the case study indicate that the cycles of ECLS-K:1998-99 and ECLS-K:2010-11 are relatively heterogeneous in terms of the effects we evaluated such that Bayesian dynamic borrowing and commensurate priors borrow less due to data heterogeneity and provide estimates similar to Bayesian multilevel growth curve model with non-informative priors (i.e., no borrowing). In order to study the performance of different Bayesian historical borrowing methods under different levels of data heterogeneity as well as varying levels of sample size, a comprehensive simulation study was further conducted.

Model specification and estimation
For the simulation study, a Bayesian multilevel linear growth curve model was used (details to follow). For the Markov chain Monte Carlo simulations, 2000 iterations Page 18 of 30 Kaplan et al. Large-scale Assessments in Education (2023)

54.25
Page 19 of 30 Kaplan et al. Large-scale Assessments in Education (2023) 11:2 (where the first 1000 were used as warm-up iterations and discarded) were run for each of the four chains. We ran 500 replications, out of which the model converged well for at least 489 replications. Only the replications with converged models were used. The replications with split-chain potential scale reduction factors R ≥ 1.05 were discarded (see e.g. Gelman et al., 2014).

Data generation and simulation conditions
For our simulation study, we evaluated the impact of the number of schools, school size, heterogeneity of historical information, and prior choice. We used the historical cycle ECLS-K:1998-99 as the base to generate the data of the current cycle. Let G denote the number of schools and let n denote the number of students per school. We examined four different sample sizes: (1) G = 10, n = 20; (2) G = 10, n = 40, (3) G = 20, n = 20, Leave-one-out cross validation information criterion with each student left out one at a time; LOOIC (School): Leave-one-out cross validation information criterion with each school left out one at a time; BLR: Bayesian Linear Regression; AGDP: aggregated data-dependent prior; PP: power prior; BDB: Bayesian dynamic borrowing; IG: inverse-gamma prior for time-level variance of the joint prior distribution, which determines the degree of time-level borrowing; W2: Wishart prior with weak borrowing for student-level (the former) or school-level (the latter) precision matrix (results were converted back the covariance matrix); W20: Wishart prior with strong borrowing for student-level (the former) or school-level (the latter) precision matrix (results were converted back the covariance matrix); CP: commensurate prior Page 20 of 30 Kaplan et al. Large-scale Assessments in Education (2023) 11:2 Table 4 Growth curve model coefficient and variation estimates for the full sample of female students Intcp:intercept; t: time; Stu-Level: student-level; Sch-Level: school-level; Var(R): variance of time-level residual; BLR: Bayesian Linear Regression; AGDP: aggregated data-dependent prior; PP: power prior; BDB: Bayesian dynamic borrowing; IG: inverse-gamma prior for time-level variance of the joint prior distribution, which determines the degree of time-level borrowing; W2: Wishart prior with weak borrowing for student-level (the former) or school-level (the latter) precision matrix (results were converted back the covariance matrix); W20: Wishart prior with strong borrowing for student-level (the former) or school-level (the latter) precision matrix (results were converted back the covariance matrix); CP: commensurate prior Page 21 of 30 Kaplan et al. Large-scale Assessments in Education (2023) 11:2 and (4) G = 20, n = 40. For the historical data ECLS-K:1998-99, a random sample stratified by schools was selected with one of the sample size scenarios mentioned above. The selected within-student variables included the reading score and a linear (t) and quadratic ( t 2 ) component of time. Same as the case study, the between-student variable in the simulation study is SES, and the school-level variable is percentage of students who are eligible for free lunch in each school. Data for the current cycle with each of the above four sample sizes were generated with different degrees of heterogeneity compared to the historical data. A Bayesian multilevel linear growth curve model was fit on the ECLS-K:1998-99 data with the reading score as the outcome, t and t 2 as within-student predictors, SES as the student-level predictor, and the percentage of free lunch eligible students in each school as the schoollevel predictor. The interaction terms between the linear and quadratic components of Table 5 Prediction performance of different borrowing methods for the full sample of female students Theil Inequ Coef: Theil's inequaility coefficient; Theil Bias Prop: Theil's bias proportion; Theil Var Prop: Theil's variance proportion; RMSE: root mean squared error of the predicted reading score vs. observed reading score at the spring of 5 th grade; KLD: Kullback-Leibler Divergence Score; LOOIC (Student): Leave-one-out cross validation information criterion with each student left out one at a time; LOOIC (School): Leave-one-out cross validation information criterion with each school left out one at a time; BLR: Bayesian Linear Regression; AGDP: aggregated data-dependent prior; PP: power prior; BDB: Bayesian dynamic borrowing; IG: inverse-gamma prior for time-level variance of the joint prior distribution, which determines the degree of time-level borrowing; W2: Wishart prior with weak borrowing for student-level (the former) or school-level (the latter) precision matrix (results were converted back the covariance matrix); W20: Wishart prior with strong borrowing for student-level (the former) or school-level (the latter) precision matrix (results were converted back the covariance matrix); CP: commensurate prior  Kaplan et al. Large-scale Assessments in Education (2023) 11:2 time and student-level SES as well as school-level free lunch eligible percentage were also included. The student-level intercept and slope, which are students' starting points and growth rates, were allowed to be random. The school-level intercept was treated as random. Fixed effect and random effect estimates from the historical cycle were obtained and used to generate the current cycle's data. That is, for the current cycle, predictor values were sampled from the ECLS-K:2010-11 data with a certain sample size, while the outcome (reading score) was generated with different generating coefficients. For the homogeneous condition, the regression coefficient estimates obtained based on the historical cycle were used as the generating coefficients to generate data for the current cycle. For the heterogeneous condition, the historical regression coefficient estimates with adjustments ranging from − 10% to + 150% were used to generate the data of the current cycles so that the regression coefficients in the current cycle would be heterogeneous compared to the historical regression coefficients (specific adjustments are included in the supplemental material). The growth trajectories of one simulated data set are illustrated in Fig. 1 for historical cycle (upper), homogeneous condition of the current cycle (middle) and heterogeneous condition of the current cycle (bottom), one plot per sample size. Data for the current cycle were generated with the same proportion of schools in each of the four poverty categories differentiated by percentage of students who are eligible for free lunch.

Stu-level
Regarding prediction, we used the same method as discussed in the case study. That is, for the current cycle, we used the first eight time points (Time 0 to 5, 7 and 9) to estimate growth curve parameters via different historical borrowing methods and then  Kaplan et al. Large-scale Assessments in Education (2023) 11:2 evaluated their performance on predicting students' reading scores at the last time point (Time 11), which is the spring of fifth grade. With regard to prior choice, similar to the case study, we assessed the performance of dynamic priors and compared it to the no borrowing case, aggregated data-dependent priors, complete pooling, and power priors. Specifically, for Bayesian dynamic borrowing, we varied the inverse-Gamma prior for τ 2 at IG(1, 1) and IG(1, 0.001). The precision matrices of the student-level random intercepts and random slopes have Wishart distribution W (ν, νS −1 ) where ν takes 2 (weak borrowing) or 20 (strong borrowing) and S = ′ S S is the baseline precision where S is a diagonal matrix whose diagonal elements are distributed as half-Cauchy(0, 1) and ∼ LKJCorr(1) . For commensurate priors, Wishart prior with ν = 2 was used for both student-level and school-level random effects. For power priors, again, we varied a h at 0.25, 0.50 and 0.75. All code for the simulation study are available in the supplementary material.

Results of the simulation study
Two sets of results from the simulation study are presented, one being the evaluation of growth curve model parameter recovery as illustrated in Fig. 2 (for regression coefficient estimates) and Fig. 3 (for variation estimates), and the other being the quality of prediction on students' reading score at the spring of 5 th grade across different borrowing Page 24 of 30 Kaplan et al. Large-scale Assessments in Education (2023) 11:2 methods as illustrated in Fig. 4 (for the homogeneous condition) and Fig. 5 (for the heterogeneous condition). Figures 2 and 3 display three columns of plots indicating no borrowing, static borrowing (pooling, aggregated data-dependent prior, and power prior), and dynamic borrowing (BDB and commensurate prior). The x-axis indicates two different heterogeneity (hm-homogeneous; ht-heterogeneous) by four different sample size conditions (Gnumber of schools; n-number of students per school), in total eight different conditions. The rows indicate different regression coefficients (in Fig. 2) or random variations (in Fig. 3), including student-level variance and covariance of random intercept and slopes, school-level variance of random intercept, and variance of residuals.
Root mean squared errors (RMSEs) between the generating/true parameters and estimated parameters are used to evaluate the accuracy of parameter estimates by different borrowing methods. Figure 2 shows that overall, RMSEs for regression coefficient estimates were smaller when sample sizes were larger and when the current data was more homogeneous to the historical data. Under the same sample size and heterogeneity condition, dynamic borrowing methods were better or similar to no borrowing across different regression coefficients. For example, BDB under the IG (1, 0.001) hyperprior and commensurate prior provided smaller RMSEs for the coefficient of FLunch than no borrowing. The static borrowing methods provided similar RMSEs to those with no  borrowing and dynamic borrowing methods under the homogeneous condition, but the performance varied under the heterogeneous condition. Specifically, under the heterogeneous condition, the aggregated data-dependent priors provided the smallest RMSE for the quadratic term of time (acceleration rate), but the largest RMSEs for the intercept (starting point) and the linear term of time (growth rate). The power prior methods provided the same or better RMSEs for the regression coefficient estimates overall.
For the variation estimates shown in Fig. 3, no borrowing, aggregated data-dependent priors, and dynamic borrowing provided similar RMSEs for student-level random variation estimates. Nevertheless, BDB under IG(1, 0.001) and Wishart(20) hyperpriors provided the smallest RMSE for school-level random intercept variation. No borrowing and AGDP provided the smallest RMSEs for the variance of residuals. Power priors, in contrast, provided the largest RMSEs for the variance of residuals. Under the heterogeneous condition, power priors yielded the largest RMSEs for all the variation estimates.
In terms of predictive performance, the Theil inequality coefficients, bias and variance proportions, RMSE between the observed and predicted reading score at the spring of 5th grade, KLD, LOOIC with one student left out at a time and LOOIC with one school left out at a time were adopted to evaluate the prediction quality of different borrowing methods under different sample size and heterogeneity conditions. As Fig. 4 shows, when the current data and the historical data were relatively homogeneous, we found that no borrowing, AGDP, and dynamic borrowing performed similarly in terms of prediction. There were minor differences in Theil's inequality coefficients, bias proportion and variance proportion, where no borrowing and AGDP performed slightly better. Based on RMSE for prediction and KLD, power priors performed worse than other borrowing methods. Pooling was slightly better than power priors, but also did not outperform no borrowing, AGDP, or dynamic borrowing methods. Across different sample sizes, the predictive performance based on the LOOIC across different borrowing methods was similar.
When the current data and the historical data were heterogeneous, as Fig. 5 shows, pooling performed worse on prediction compared to its performance under the homogeneous condition and had the largest RMSE of prediction and KLD. Power priors also had large RMSE of prediction and KLD compared to no borrowing, AGDP and dynamic borrowing. BDB and commensurate priors performed similarly to no borrowing and AGDP in terms of prediction. Similar to the homogeneous condition, the predictive performance based on the LOOIC across different sample size conditions was similar.

Conclusion
As we noted in the introduction, longitudinal data are becoming increasingly available and are powerful data collection strategies that allow for the study of important outcomes in education and the social and behavioral sciences. That said, longitudinal data are also extremely expensive to collect, and except in rare cases, few longitudinal studies exist that have gone beyond two separate cohorts. Thus, it becomes even more critical that new methodologies be developed to leverage information across longitudinal studies while accounting for the heterogeneity that can be induced by cohort effects as well as other changes in the data collection strategies across cycles.
The purpose of this study was to build on recent work by Kaplan et al. (2022), and examine a variety of methods under the umbrella of Bayesian historical borrowing with a specific focus on models for growth in longitudinal studies. Outcomes of interest in our study were the parameter recovery and predictive performance of growth models under a variety of historical borrowing methods. We utilized both a case study and a simulation study in the context of one historical cycle insofar as this is a relatively realistic situation. We note that this is in contrast to Kaplan et al. (2022), who examined Bayesian historical borrowing for cross-sectional multilevel models based on the structure of PISA and with five historical cycles.
Our findings show that in the case of one historical cycle, most methods of historical borrowing perform similarly with respect to predictive performance and parameter recovery. We note that in the present paper, and consistent with Kaplan et al. (2022), pooling and power priors performed relatively poorly across the conditions in this study, particularly when the current data and the historical data were heterogeneous. The findings regarding power priors are not surprising insofar as previous studies have shown relatively poor performance of power priors in a variety of settings (see Du et al., 2020, and references therein), though power priors were not examined in the case of longitudinal studies with a focus on prediction. The findings regarding pooling under the homogeneous condition are a bit surprising insofar as the homogeneous condition of our simulation study mimicked the desirable conditions for combining data from longitudinal studies as described in Hofer and Piccinin (2009), e.g. common time points and identical measurements. However, the current data and the historical data might still not have been homogeneous enough given that the generating model is a complex multilevel growth curve model. Overall though, the findings show that using aggregated data-dependent priors or simply using the current cycle of data with non-informative priors performed well. We speculate that this is because we only examined the realistic condition of one historical study. Note, however, that Kaplan et al. (2022) found clear benefits of Bayesian dynamic borrowing over other methods of historical borrowing in the cross-sectional multilevel case study using PISA (OECD, 2002; with five historical cycles, and so, in line with Kaplan et al. (2022), we argue that Bayesian dynamic borrowing or commensurate priors is a prudent choice for borrowing information from previous cycles of data and that sensitivity analyses examining a variety of borrowing procedures to gauge the extent of homogeneity or heterogeneity across data sets would be advisable. Also, future research should expand the current study to consider more cycles of longitudinal data. 8 It is important to point out some limitations of this study as they pertain specifically to the application of these methods to longitudinal data. First, we did not account for the sampling weights, which are critical in large-scale assessments. The problem of sample weighting in Bayesian models generally has been discussed in Gelman (2007), who declared at the time that "Survey weighting is a mess" (pg. 153). Since then, however, there have been important developments in the implementation of sampling weights for Bayesian analyses with applications to problems of survey data (see e.g. Goldstein, 2011;