In educational and psychological research, it is common to use latent factors to represent constructs. Latent factors are often established using the common factor model that includes both exploratory and confirmatory factor models. After factor models are run and tested against empirical data, there is usually a need for further analysis that involves effects of other covariates. For example, researchers may be interested in knowing whether the same factor structure would work for a normative sample vs. a referral sample (e.g., Parkin & Wang, 2021) or whether student sex and grade would be significant predictors of classroom engagement (e.g., Wang et al., 2014b). Within the framework of structural equation modeling (SEM), there are typically two methods for covariate effects on latent factors. The first is the multiple-indicator multiple-cause (MIMIC) approach. With this approach, the covariates are included in the model as predictors of the latent factors; the direct effects of covariates on the latent factors are interpreted in the same way as regression coefficients. Statistical significance and effect sizes can also be obtained. The second approach, particularly when the covariates are categorical variables, invokes multiple group analysis where data are divided according to the values on the categorical variables and equality of model parameters (e.g., factor loading, indicator intercepts, latent factor means) across groups can be tested.
These two approaches have been widely used. With the MIMIC approach, it is easy to accommodate many covariates and both continuous and categorical covariates can be used. However, the MIMIC model assumes that latent factors are measured in the same way for different values of the covariates. Further, only linear (and variants of linear) relationships between the covariates and latent factors are allowed.
In contrast, for multiple group analysis, model parameters are allowed to vary and invariance between groups can be tested. It is also advised that measurement invariance testing precedes comparisons of the groups on the latent factors (Meredith, 1993; Millsap, 1997; Rensvold & Cheung, 1998). Compared to MIMIC, multiple group analysis is typically limited to a small number of groups, although Bayesian methods have been proposed for handling many groups (Muthen & Asparouhov, 2014).
Recently, structural equation model trees (SEM Trees; Brandmaier et al., 2013) have been proposed. SEM Trees are a generalization of decision trees that build a tree structure to separate data into subsets. The same SEM model is fit to data from each subset, but model parameters are separately estimated for each subset. The splitting of the data into subsets is based on covariates and done recursively with some criteria and stopping rules. SEM Trees have advantages in examining covariate effects because they can handle many different types of covariates and the relationships between the covariates and latent factors can be nonlinear. Further, it is not necessary to pre-specify the relationships, allowing data-driven explorations.
In this paper, we compare and contrast the three methods to examine the effects of covariates on the latent factor of mathematics self-concept using the U.S. eighth-grade data from the Trends in International Mathematics and Science Study (TIMSS) 2019 database (Fishbein et al., 2021).
Confirmatory factor analysis
Confirmatory Factor Analysis (CFA) is a popular measurement model used by researchers in educational, psychological, and social science fields. Under CFA, it is hypothesized that a latent factor is measured by multiple indicator variables. The latent factors would typically represent some type of unobserved constructs (e.g., motivation, engagement, attitudes) that are manifested by the observed indicator variables. One of the main advantages of CFA is that the latent factors are free from measurement errors. With CFA, all measurement error is assumed to be part of the observed indicator variables, and the latent factors represent the pure, shared variance among the indicators. Due to this, the effects of covariates on the latent factors are not affected by attenuated relationships.
CFA is a type of common factor model (Brown, 2006; Thurstone, 1947). The common factor postulates that each measured variable is a linear function of one or more common factors and a unique variable. Once the common factor(s) are removed, the observed variables are uncorrelated with each other. The unique variable is a combination of measurement error and specific error that is due to the selection of the measured variable. Suppose there are data of N participants on p observed variables and the score for the ith person on the jth variable is denoted as \(Y_{ij}\). The linear factor model can be written as
$$Y_{ij} = v_{j} + \lambda_{j1} \eta_{1i} + \lambda_{j2} \eta_{2i} + ... + \lambda_{jk} \eta_{ki} + \varepsilon_{ij}$$
In matrix form, the response vector of participant i can be written as
$${\mathbf{y}}_{i} {\mathbf{ = v + \Lambda \eta }}_{i} {\mathbf{ + \varepsilon }}_{i}$$
(1)
where \({\mathbf{y}}_{i}\) is a p × 1 vector of p observed variables, \({\mathbf{v}}\) is the p × 1 vector of item intercepts, \({{\varvec{\Lambda}}}\) is a p × m matrix of factor loadings, \({{\varvec{\upeta}}}_{i} \sim N({\mathbf{\kappa ,\Phi }})\) is an m × 1 vector of common factors and \({{\varvec{\Phi}}}\) is an m × m matrix of factor covariance matrix, \({{\varvec{\upvarepsilon}}}_{i} \sim N(0{\mathbf{,\Theta }})\) is a p × 1 vector of unique factors and \({{\varvec{\Theta}}}\) is a p × p matrix of unique variances and covariances.
Further, it is assumed that
\(E({{\varvec{\upeta}}}_{i} ) = 0\), \(E({{\varvec{\upvarepsilon}}}_{i} ) = 0\), and \(Cov({{\varvec{\upeta}}}_{i} ,{{\varvec{\upvarepsilon}}}_{i} ) = 0\).
Under these assumptions, the population mean vector \({{\varvec{\upmu}}}\) and the population covariance matrix \({{\varvec{\Sigma}}}\) of the p observed variables can be written, respectively, as
$${\mathbf{\mu = v + \Lambda \kappa }}$$
(2)
$${{\varvec{\Sigma}}} = Cov\left( {{\mathbf{y}}_{i} {\mathbf{y}}_{i}^{\prime } } \right) = {{\varvec{\Lambda}}}E({{\varvec{\upeta}}}_{i} {{\varvec{\upeta}}}_{i}^{\prime } ){{\varvec{\Lambda}}}^{\prime} + E({{\varvec{\upvarepsilon}}}_{i} {{\varvec{\upvarepsilon}}}_{i}^{\prime } ) = {\mathbf{\Lambda \Phi \Lambda }}\prime {\mathbf{ + \Theta }}$$
(3)
where \({{\varvec{\Sigma}}}\) is a p × p population covariance matrix of the observed variables, \({{\varvec{\Phi}}}\) is an m × m matrix of factor covariance matrix, \({{\varvec{\Theta}}}\) is a p × p matrix of unique variances and covariances.
Because latent factors are unobserved, it is necessary to set a location and metric for each latent factor. Two methods are commonly used: (a) putting the latent factors on a scale of zero mean and standard deviation of 1; and (b) choosing a marker indicator and set its loading to 1 and intercept to 0. Another method, called the effects coding method, imposes linear constraints on the unstandardized pattern coefficients to identify the model and can also be used (Little et al., 2006).
Covariate effects on latent factors
Whereas CFA, as a measurement technique, is often used for scale development and validation (typically together with exploratory factor analysis—another common factor model; e.g., Pratscher et al., 2019), it is also widely used in examining covariate effects. These covariates could represent demographical differences among individuals, or they could be attitudinal, psychological, situational, or trait variables. For example, the researcher may be interested in whether age is related to the latent factor means (e.g., Frisby & Wang, 2016). Covariates could be observed variables, or they themselves could be unobserved and constructed from measurement models such as CFA. For this study, only observed covariates are considered.
MIMIC approach to covariate effects
For (observed) covariate effects on latent factors, based on Eq. (1), we further have
$${{\varvec{\upeta}}}_{i} = {{\varvec{\upalpha}}} + {\mathbf{\Gamma x}}_{i} + {{\varvec{\upzeta}}}_{i}$$
(4)
where \({\mathbf{x}}_{i}\) is a q × 1 vector of observed covariates, \({{\varvec{\Gamma}}}\) is an m × q matrix of regression coefficients representing the covariate effects on latent factors, \({{\varvec{\upzeta}}}_{i}\) is an m × 1 vector of disturbances, \({{\varvec{\upzeta}}}_{i} \sim N(0,{{\varvec{\Psi}}})\), and \({{\varvec{\upalpha}}}\) is an m × 1 vector of intercepts of the latent factors that are typically set to be zero.
The model parameter vector then is \({{\varvec{\uptheta}}} = ({\mathbf{v}},{{\varvec{\Lambda}}},{\mathbf{\kappa ,\Phi }},{{\varvec{\Theta}}},{{\varvec{\Gamma}}},{{\varvec{\Psi}}},{{\varvec{\upalpha}}})\). For model identification, it is often the case that \({\text{diag}}({{\varvec{\Phi}}}) = {\mathbf{I}}\), \({{\varvec{\upkappa}}} = 0\), \({\mathbf{v}} = 0\), \({{\varvec{\upalpha}}} = 0\) (see Wu & Estabrook, 2016).
The MIMIC model is a single-group analysis and a special type of the full SEM model. In a MIMIC model, covariates directly affect the latent factor(s) and the path coefficients from the covariates to the latent factor(s) represent their effects. With a categorical covariate with more than two categories, some coding scheme (e.g., dummy coding) is used to create dummy variables. The effects of dummy variables on the latent factor(s) represent group differences, controlling for the other covariate(s).
The MIMIC approach is a direct extension of the linear regression model. The regular assumptions for regression models (independence of observations, linearity, and no correlations between covariates and the disturbance) also apply to the MIMIC model. A practical difference between the MIMIC model and the regression model is that the coefficients from dummy variables to the latent factor(s) in the MIMIC model should be standardized with respect to the latent factors because the scale of the latent factors is arbitrary, whereas in regression the unstandardized coefficients reflect group comparisons on the dependent variable.
The MIMIC approach is a standard method in SEM software packages such as Mplus (Muthén & Muthén, 1998–2017) and the “lavaan” R package (Rosseel, 2012).
Multiple group confirmatory factor analysis
When the covariates are categorical variables with a relatively small number of categories, their effects can be and often are examined using multiple group CFA (MG-CFA). The advantage of using MG-CFA is that cross-group equality of different types of parameters (e.g., factor means, factor variances, and covariates) can be tested. In addition to structural level parameters that involve latent factors and relationships between them, measurement level parameters—which represent relationships between latent factors and the observed indicators variables—are often investigated as well. There is a large body of literature on measurement invariance under the CFA framework, both methodologically (e.g., Liu et al., 2017; Meredith, 1993; Millsap, 2011), and empirical applications (e.g., Chan et al., 2019).
MG-CFA for covariate effects can be thought of as an extension of the analysis of variance (ANOVA) for group differences on observed means. Population parameters for different groups are specified and tested, typically through null hypothesis significance testing (NHST). For ANOVA, the population parameters for NHST are the means, and the testing assumes that the groups have the same variance on the outcome variable in the population. When MG-CFA is used for covariate effects, the population parameters to be tested under NHST usually include mean differences on the latent factors (for identification purposes, the latent factor means for a reference group are usually constrained to zero), factor variances and covariances; however, other parameters can also be tested.
For MG-CFA, group sizes should be large enough to run CFA using data from individual groups. In addition, when there are many groups, even small differences between model parameters would be statistically significant, although Bayesian methods could be used for testing measurement invariance among many groups (Muthen & Asparouhov, 2014). When the covariate is continuous, some categorization is necessary before conducting MG-CFA.
When a covariate x represents group membership, instead of explicitly modeling the effect of x on latent factors as in Eq. (4), the covariate is used to subset data in MG-CFA. When there are multiple covariates, the researcher can either run multiple MG-CFA models, each time with a single covariate, or construct groups based on these covariates before conducting MG-CFA. The latter method may suffer from small sample sizes when the data are sliced in more ways. With G groups, Eqs. (5) and (6) show the population mean vector and the population covariance matrix of the p observed indicator variables, respectively, for a specific group g.
$${{\varvec{\upmu}}}_{g} {\mathbf{ = v}}_{g} {\mathbf{ + \Lambda }}_{g} {{\varvec{\upkappa}}}_{g}$$
(5)
$${{\varvec{\Sigma}}}_{g} = {{\varvec{\Lambda}}}_{g} {{\varvec{\Phi}}}_{g} {{\varvec{\Lambda}}}_{g} {\mathbf{^{\prime} + \Theta }}_{g}$$
(6)
The parameter vector \({{\varvec{\uptheta}}}\) is expanded to include parameters for multiple groups. For model identification, it is necessary to constrain parameters for each group (Millsap, 2011). When there are no equality constraints across groups, identification constraints for each group are similar to those for single group CFA (e.g., identifying the scale of latent factors). With equality constraints across groups (e.g., equal factor loadings, equal item intercepts), identification constraints are typically different for one group (e.g., the first group) compared to the other groups.
The biggest advantage of using MG-CFA is testing equality of different types of parameters across groups (i.e., invariance testing). In fact, invariance testing has been increasingly used in the development and validation and scales that involve CFA (e.g., Wang et al., 2014b). Like the MIMIC approach, MG-CFA is a standard method in SEM software packages such as Mplus (Muthén & Muthén, 1998–2017) and the “lavaan” R package (Rosseel, 2012).
Decision trees
Decision trees, also called trees, classification and regression trees (CART; Breiman et al., 2017; Loh, 2011), or recursive partitioning, are methods to split (i.e., partition) the space of covariates into subsets. Response values are similar within each subset but different between subsets. The partitioning is repeated recursively until no splitting could be done based on some stopping criteria. When the outcome is a categorical variable, classification trees are built; when the outcome is numeric, regression trees are built.
Decision trees have been extended to incorporate parametric models (model-based recursive partitioning; MOB; Zeileis et al., 2008). With MOB, a stochastic model (e.g., a regression model) is assumed (called the template model); and the sample is split into groups with different values of model parameters. For example, if the template model is a regression model, the intercept and slopes may vary between subgroups according to some covariates. Therefore, in an example of regressing achievement on motivation, the intercept and slope may differ for students with different socioeconomic status (SES); therefore, a tree could use SES to divide participants based on differences in the regression model parameters.
MOB has been used to incorporate different stochastic models (e.g., Item Response Theory models) and decision trees (Brandmaier et al., 2013; Jeon & De Boeck, 2019; Merkle et al., 2014; Spratto et al., 2021; Wang et al., 2014a). SEM Trees (Brandmaier et al., 2013) combine recursive partitioning and SEM. SEM Trees use the likelihood ratio test or score-based tests to split observations based on covariates.
SEM Trees
For each covariate, data are split along all possible points of that covariate to create homogeneous groups according to some criteria (typically the likelihood but score-based tests are also available). These splits are binary splits, meaning that when a split happens, data are split into two groups (i.e., two nodes). For each candidate split, the log-likelihood values before and after the potential split are obtained. Because the model before the split is nested within the model after the split, a likelihood ratio test can be used to compare models. The partition of the covariate that leads to the greatest improvement in the model is retained. The process continues until a stopping criterion is reached. Stopping criteria could be a maximum tree depth, a minimum number of observations in a node, the p-value for the likelihood ratio test, etc.
There are a few different packages that can be used to implement SEM Trees. The “semtree” R package (Brandmaier et al., 2013) is a tree algorithm designed specifically for SEM. The package is based on the “OpenMx” package (Boker et al., 2011), which is a flexible R package that allows estimation of a wide variety of advanced multivariate statistical models including SEM. The “semtree” package can also be used together with the “lavaan” (Rosseel, 2012), a most popular R package for SEM. Another R package, “partykit” (Zeileis et al., 2008), is a general framework for MOB. To implement SEM Trees with the “partykit” package, some preliminary work is necessary to set up the SEM. It is possible to set up the SEM model with “lavaan”.
Both “semtree” and “partykit” are solely based on the R language. Another package, “MplusTrees” is based on Mplus (Muthén & Muthén, 1998–2017) and the “MplusAutomation” R package (Hallquist & Wiley, 2018) that serves as an interface between Mplus and R (Serang et al., 2021). Mplus Trees taking advantage of the comprehensive Mplus software, allows users to specify complex SEM models using the regular Mplus syntax. The splitting procedure for the tree to grow is determined by the complexity parameter (cp) due to the package’s reliance on the “rpart” package (Therneau & Atkinson, 1997). cp reflects the relative improvement in the model fit for the split to be retained. If a candidate split improves the -2logL of the root node by a factor of at least cp, the split is made. The smaller the cp, the more complex the final tree is likely to be. Other stopping criteria such as the minimum number of observations within a node needed to attempt a split, the minimum observations within a terminal node, the maximum depth of the tree, the p-value for likelihood ratio tests can also be used/added.