Skip to main content

Using country-specific Q-matrices for cognitive diagnostic assessments with international large-scale data

Abstract

In cognitive diagnosis assessment (CDA), the impact of misspecified item-attribute relations (or “Q-matrix”) designed by subject-matter experts has been a great challenge to real-world applications. This study examined parameter estimation of the CDA with the expert-designed Q-matrix and two refined Q-matrices for international large-scale data. Specifically, the G-DINA model was used to analyze TIMSS data for Grade 8 for five selected countries separately; and the need of a refined Q-matrix specific to the country was investigated. The results suggested that the two refined Q-matrices fitted the data better than the expert-designed Q-matrix, and the stepwise validation method performed better than the nonparametric classification method, resulting in a substantively different classification of students in attribute mastery patterns and different item parameter estimates. The results confirmed that the use of country-specific Q-matrices based on the G-DINA model led to a better fit compared to a universal expert-designed Q-matrix.

Introduction

International comparative assessments such as PISA (Programme for International Student Assessment) or TIMSS (Trends in International Mathematics and Science Study) have the power to influence educational policy and practice to a large extent (Sedat & Arican, 2015). Item response theory (IRT; Baker, 2001) has traditionally been used to analyze such large-scale assessments and to provide information about students’ abilities. This method summarizes students’ overall ability in a particular subject (e.g., mathematics, reading, or science) by means of a single ability score (Chen, 2017; Nájera et al., 2019). Student achievement is then compared across countries and an international benchmark is set. Unfortunately, the general ability score provides neither teachers nor policymakers with the fine-grained diagnostic information necessary to determine if students have mastered a particular domain. This makes the implementation of a targeted educational strategy based on international large-scale assessments difficult. Cognitive diagnosis assessment (CDA) allows to understand students’ assessment outcomes by fine-grained attributesFootnote 1 that are directly related to students’ success in a given subject domain so that statistical data analysis may provide richer information regarding what types of attributes students have mastered (Jurich & Bradshaw, 2014).

The Generalized Deterministic Inputs, Noisy “And” Gate model (G-DINA; de la Torre, 2011), one of the popularly used psychometric models for the CDAs, can be used for just this purpose. The G-DINA aims to measure to what extent students master a set of cognitive attributes (e.g., fractions, proportions, and decimals as a fine-grained cognitive mathematic attribute) to improve educational policy and practice. Some studies have already employed the G-DINA to analyze data from international comparative tests, such as PISA (Jia et al., 2021; Wu et al., 2020). However, before the G-DINA can be used on international assessments to perform the CDA, a content analysis of the test must occur (von Davier & Lee, 2019). Domain experts conduct this analysis to identify a set of related attributes or skills that measure a few broad domains and to define each item by the subset of attributes (Nájera et al., 2019). Researchers refer to such an internal structure where the item-attribute relations are specified as a “Q-matrix” (Tatsuoka, 1984). This two-dimensional matrix with items and attributes defining rows and columns, respectively, includes only one (the attribute is required to solve the item) or zero (the attribute is not required to solve the item).

Currently, using the Q-matrix for international tests has two major drawbacks. First, the Q-matrix is designed by the judgements of experts. In other words, content experts specify the cognitive attributes and their relations with the items. The expert-designed Q-matrix is not always perfect and it could be possible to have some misspecifications. When the content analysis of the Q-matrix is conducted by the fallible judgements of subject-matter experts (Chen, 2017; Nájera et al., 2019; Terzi & de la Torre, 2018), misspecifications of the Q-matrix could have serious consequences for the estimation of students’ attribute patterns and the interpretation of the data consequently (de la Torre & Chiu, 2016; Köhn & Chiu, 2018; Nájera et al., 2019). Additionally, experts from different countries may have disagreements on the relationship between items and attributes because of their different educational backgrounds, country-specific curriculum, and teaching situations. Those disagreements may produce uncertainty about the Q-matrix, which provides space for further improvements as well. Because of those, researchers have proposed various refinement methods (Chiu, 2013). This study addresses the potential impact of a misspecified Q-matrix on the CDA in TIMSS and explores the performance of refined Q-matrices.

Second, the same Q-matrix is specified for every participating country. The question that arises in the literature is that, despite the prior work on this issue, little evidence supports the use of a common Q-matrix for different types of population groups. That is, we cannot simply assume that the expert-designed universal Q-matrix from the TIMSS will perform similarly when used to analyze data from different countries. Further, we seek to determine if a particular refinement approach performs better across various countries.

In the following section, we describe the conceptual background of the G-DINA and two different Q-matrix refinement methods. Next, we apply these techniques to study whether different Q-matrix refinement approaches come to different solutions within and between countries, and what solution provides the best model fit. Finally, we investigate the usefulness of country-specific Q-matrices.

G-DINA

The G-DINA belongs to the family of Cognitive Diagnostic Models (CDMs; Rupp et al., 2010), which is considered as a special case of Latent Class Models (LCMs; Hagenaars & McCutcheon, 2002) where the attribute patterns are modelled to categorize students by means of latent class variables. To be specific, the attribute patterns of the students are conceptually unobservable and therefore have to be measured by their observed responses to a set of items in a test (Chen, 2017). CDMs are confirmatory models in nature because the relationships between the categorical latent variables (attributes) and the test items are defined a priori in a Q-matrix (Ravand & Robitzsch, 2015). There are a number of different modelling approaches within CDMs that have been in use, depending on how relationships between attributes and item responses are modelled and how attributes themselves are combined (Ravand & Robitzsch, 2015; Rupp et al., 2010).

Many CDMs assume certain relationships between items and attributes, such as the Deterministic, Inputs, Noisy, “And” Gate model (DINA; de la Torre, 2009; Junker & Sijtsma, 2001) which assumes that attributes are conjunctive, the Deterministic, Inputs, Noisy, “Or” Gate model (DINO; Templin & Henson, 2006) which assumes that attributes are disjunctive, and so forth. However, sometimes, the attribute relationship is unclear before the model application. In that respect, the G-DINA seems to be a reasonable choice to fit real-life data because it does not have constraints for the relationship of attributes (i.e., conjunctive, disjunctive, and additive assumption for attributes; de la Torre, 2011).

A saturated G-DINA for dichotomous responses with identity link is expressed as follows (de la Torre, 2011; von Davier & Lee, 2019). The model estimates the probability of success for item j under different attribute patterns. k refers to a required attribute based on Q-matrix and \({K}_{j}^{*}\) is the total number of required attributes for item j. \({\boldsymbol{\alpha }}_{{\varvec{l}}{\varvec{j}}}^{*}\) denotes the reduced attribute vector for item j,\(l={1, ..., 2}^{{K}_{j}^{*}}\), which keeps only required attributes. \(P\left({X}_{j}=1 | {\boldsymbol{\alpha }}_{{\varvec{l}}{\varvec{j}}}^{*}\right)\) denotes the probability of the correct answer to item j conditional on attribute pattern\({\boldsymbol{\alpha }}_{{\varvec{l}}{\varvec{j}}}^{*}\). For instance, in a test designed for measuring four attributes, answering item j correctly needs the 2nd, 3rd, and 4th attribute (i.e.,\({K}_{j}^{*}=3\)), and the reduced attribute pattern is denoted by\({\boldsymbol{\alpha }}_{{\varvec{l}}{\varvec{j}}}^{*}={\left({\alpha }_{l2},{\alpha }_{l3},{\alpha }_{l4}\right)}^{\mathrm{^{\prime}}}\).

$$P\left({X}_{j}=1\right|{\boldsymbol{\alpha }}_{{\varvec{l}}{\varvec{j}}}^{\boldsymbol{*}})={\delta }_{j0}+{\sum }_{k=1}^{{K}_{j}^{*}}{\delta }_{jk}{\alpha }_{lk}+{\sum }_{{k}^{^{\prime}}=k+1}^{{K}_{j}^{*}}{\sum }_{k=1}^{{K}_{j}^{*}-1}{\delta }_{jk{k}^{^{\prime}}}{\alpha }_{lk}{\alpha }_{l{k}^{^{\prime}}}+\dots +{\delta }_{j12\dots {K}_{j}^{*}}{\prod }_{k=1}^{{K}_{j}^{*}}{\alpha }_{lk}$$
(1)

In addition, \({\delta }_{j0}\) is the intercept indicating the baseline probability (i.e., the success probability without any required attribute being mastered); \({\delta }_{jk}\) is the main effect, namely the change in probability when students master a single attribute \({\alpha }_{k}\); \({\delta }_{jk{k}^{\mathrm{^{\prime}}}}\) is the first-order interaction effect due to \({\alpha }_{k}\) and \({\mathrm{\alpha }}_{{k}^{\mathrm{^{\prime}}}}\), indicating the change in probability with the mastery of two attributes; \({\delta }_{j12\dots {K}_{j}^{*}}\) is the effect of the mastery of all attributes. The higher-order interaction effects can be added to the model, but are skipped here due to parsimony (denoted by three dots in the formula). Parameters of the G-DINA are estimated by an expectation–maximization implementation of marginalized maximum likelihood estimation (de la Torre, 2011).

Q-matrix refinement methods

To handle uncertainty about some of the item-attribute relations in the Q-matrix, several studies proposed parametric (de la Torre & Chiu, 2016) and nonparametric (Chiu, 2013; Desmarais & Naceur, 2013) approaches to validate and refine the initial design. It is important to notice that these methods are all confirmatory in nature in the sense that they refine the Q-matrix initially designed by experts (Nájera et al., 2019). Furthermore, these refinement methods only handle misspecifications of how items are linked to the attributes (misspecification of rows in the Q-matrix or q-vectors) and not misspecification of the set of underlying attributes (misspecification of the columns of the Q-matrix) (Chiu, 2013). Lastly, these refinement methods are data-driven, so they use students’ answers to the test to construct a refined Q-matrix. Thus, a Q-matrix refined by the same method but using a different dataset (e.g., the TIMSS 2011 8th grade mathematics scores from a different country), is therefore likely to be different.

Stepwise validation method

The stepwise validation method for Q-matrix refinement was proposed by Ma and de la Torre (2020a, b). It combines the G-DINA Discrimination Index (GDI; de la Torre & Chiu, 2016) and the Wald statistic based on the G-DINA given an expert-defined Q-matrix, which can be regarded as an extended version of the GDI method. This method is originally designed for graded response data in tandem with the sequential G-DINA. When dichotomous items are applied, this method can still work based on the G-DINA. Compared to other Q-matrix refinement methods, the stepwise validation method does not need assumptions about the processing function and it can consider the item parameter estimation errors because of including the Wald statistic (Ma & de la Torre, 2020a, b). The mechanism of this method can be simply explained by two steps. The first step is to select required attributes by using GDI and the second step is to compare candidate attributes based on the Wald statistic. The criteria corresponding to GDI for selecting the best-recommended q-vector is Proportion of Variance Accounted For (PVAF). de la Torre and Chiu (2016) recommended 0.95 as the rule of thumb for PVAF. When the PVAF of q-vectors is larger than 0.95, a q-vector with the lowest number of attributes will be recommended.

The operation is explained in detail and a flowchart can be consulted in Appendix 1 (Fig. 3). For a certain item, two sets of attributes are defined, including a set A of all required attributes and a set B of all target (or candidate) attributes that need to be tested. A q-vector search bank C with all single-attribute competing q-vectors is defined as well. Initially, A is an empty set and B is a set with all attributes. The first step is to replace the provisional q-vectors in an expert-designed Q-matrix with the competing q-vector in C for the target attribute in B, and calculate relevant PVAF (Ma & de la Torre, 2020a, b). The target attribute with the highest PVAF is defined as a required attribute, and it will be moved from B to A. The second step is to examine whether the q-vectors with required attributes from A are recommended by GDI (i.e., PVAF > 0.95; Ma & de la Torre, 2020a, b). When the PVAF of the q-vector with the required attribute from A is higher than 0.95, the validation process will stop and it means the required attribute in A is validated. If not, the search bank C will be updated where the competing q-vector becomes a vector with all required attributes in A and one target attribute in B. There will be some competing q-vectors with all required attributes in A and different target attributes in B simultaneously. Then, the target attributes need to be examined whether they are necessary for competing q-vectors by performing tests based on the Wald statistic. If the Wald statistic suggests all target attributes are not necessary, the validation procedure will stop. Otherwise, at least one target attribute can be recommended. Among them, the one in the competing q-vector with the highest PVAF is regarded as the required attribute, and it will be moved from B to A. After that, the required attribute is examined by the Wald statistic, and the unnecessary one will be moved from A to B. The procedure is iterated until no attributes can be added or removed between set A and set B (Ma & de la Torre, 2020a, b).

Some limitations of this method have been identified. In the proposed algorithm, the cut-off value for PVAF in the part of GDI is fixed to 0.95, which has been criticized. Liu (2015) and Wang et al. (2018) supported that the cut-off value should be adjusted based on sample size. Another limitation is the Wald statistic. As Ma and de la Torre (2020a, b) said, the Wald test is an important component of the stepwise validation method and its performance can be further improved when using a better-estimated variance–covariance matrix (Liu et al., 2019).

Chiu’s nonparametric classification method

Chiu (2013) also proposed a method to identify and correct mispecified q-vectors in a Q-matrix. The method is based on the nonparametric classification method and comparisons of the residual sum of squares (RSS) between the observed and predicted responses among all the possible Q-matrices given an expert-based Q-matrix. The algorithm consists of various steps. A flowchart is added in Appendix 1 (Fig. 4) to clarify the different steps of the algorithm (Chiu, 2013). The algorithm begins by selecting the item with the highest RSS, which is most likely to be misspecified, and the q-vector that should be updated. Then, the algorithm searches over all possible q-vectors and replaces the q-vector under consideration with the one with the lowest RSS. The algorithm is an iterative procedure where it will stop when all items are visited and the RSS of each item hardly changes anymore (Chiu, 2013). An advantage of this method, compared to model-based methods, is that it does not rely on the model parameters of CDMs when optimizing the algorithm. Furthermore, the approach guarantees good student classification even when the true CDM underlying the observed item responses is unknown. Performance, effectiveness, efficiency, and applicability were proven through simulation studies by Chiu (2013).

One major limitation of this method is that it is unable to handle missing data in the dataset. Because of the booklet design of many large-scale assessments (including TIMSS), the missingness by design is a common feature of these studies. Thus, we need to impute all the missing data before Chiu’s method can be performed. Second, Chiu’s method is a nonparametric method while parametric models should provide more powerful results when the distributional assumptions are not violated, especially for large samples (Terzi & de la Torre, 2018).

The overarching goal of this study is to explore the adequateness of (1) the selected Q-matrix refinement techniques and (2) the country specificity of Q-matrices in the case of the TIMSS 2011 8th grade mathematics test. This leads to the following research questions:

  • Do the country-specific refined Q-matrices offer a better model fit than the original Q-matrix designed by domain experts? If so, is there a particular refinement method that performs better (or worse) than other methods?

  • Does using the country-specific Q-matrices with the best model fit alter the interpretation of diagnostic assessments for the TIMSS 2011 eighth-grade mathematics assessment? If so, does this impact differ across countries?

Materials and methods

TIMSS 2011 8th grade mathematics

For this study, we used data from the TIMSS 2011 eighth-grade mathematics assessment. Specifically, we used student responses to 89 items that were released in the TIMSS database. They were comprised of 48 multiple choice items, 32 open-ended questions, and 9 constructed response questions. A score of 1 was given to a completely correct answer and a score of 0 was given to a partly correct or wrong answer. Omitted items were scored as incorrect (0) and missing items by design were scored as non-available (NA). Two criteria were used to select countries in the database: (1) the TIMSS 2011 results of the country were reliably measured (according to Mullis et al., 2012); and (2) the five countries are part of different continents. Consequently, we chose Finland (Europe), the USA (North America), Singapore (Asia), Australia (Oceania), and Tunisia (Africa) for the study. Table 1 provides the sample sizes from five countries.

Table 1 Number of (Selected) Students of TIMSS 2011 8th Grade Mathematics Test per Country

Q-matrix for 8th grade TIMSS 2011 mathematics

In order to analyze data with the G-DINA, subject-matter experts must prepare a Q-matrix for the test items. Details about constructing Q-matrices with regard to TIMSS 2007 and 2011 are presented in previous studies (Johnson et al., 2013). In essence, the procedure begins with four content domains specified in the original TIMSS framework (i.e., number, algebra, geometry, and data & chance) where the domains are further described by multiple topic areas and the accompanying 55 objectives that are a part of math curricula from a majority of countries. For TIMSS 2011 8th grade mathematics assessment, the experts combined more related objectives and defined a total of nine attributes comprised of 89 items. The list of attributes and the number of items involved in the Q-matrix can be found in Table 2. Descriptions of those attributes are available in Johnson et al. (2013) (see the excerpt in Appendix 2). The Q-matrix for those released item sets in the TIMSS 2011 8th grade mathematics assessment is available in Park et al. (2017).Footnote 2 Figure 1 gives an example of a multiple-choice question among the released items. According to the Q-matrix, this item requires mastery of two attributes, expressions, equations and functions and measurement to be answered correctly.

Table 2 Attributes and Frequency of attributes in the Original Q-matrix

Analyses

Q-matrix refinements and G-DINA

The research questions were investigated empirically by fitting the G-DINA using the original and refined Q-matrices. First, the expert-designed Q-matrix (for the 89 selected items) was validated and refined based on the five selected countries’ data as well as on the combined data from the five countries by two selected refinement methods: (1) stepwise validation method (further referred to as stepwise method) (2) Chiu’s nonparametric classification method (further referred to as Chiu’s method). Next, the Q-matrices of each of the five countries and of the combined data were compared with the expert-designed Q-matrix. Thereafter, the analysis of the G-DINA with the different Q-matrices (i.e., one expert-designed Q-matrix and two refined Q-matrices based on two refinement methods) was conducted for each country. To avoid the problem of using the same data for refining the Q-matrix and estimating model-fit indices to make relevant evaluations stable and reliable, the data of each country were divided into two parts: a random subset of 50% of the data was used for Q-matrix refinement and other 50% for the G-DINA estimation. This operation was repeated ten times and the average value for each model-fit index was used as the evidence for conclusions. We used R 4.1.2 (R Core Team, 2021) with the G-DINA package (version 2.8.8; Ma & de la Torre, 2020a, b) and the NPCD package (version 1.0–11; Zheng et al., 2019) to perform the Q-matrix refinements and G-DINA analysis.

Handling missing data

Since Chiu’s method is unable to handle missing data, we needed to replace the missing data with substitute values. To this end, the common imputation method of Predictive Mean Matching (PMM; Little, 1988; Rubin, 1986) was used for each column of the data. The basic idea of PMM is that for each missing value, the method forms a small set of candidate values and matches one of those observed values for the corresponding missing cell. By using PMM, all missing entries in the different datasets could be replaced. It is worth noting that most of the missings in the datasets were in fact produced by test design (i.e., booklet format) and the planned missings are considered as missing completely at random, so the analysis based on imputed data does not cause biased results (Little & Rubin, 2002). In order to make relevant estimated results reliable and make the methods comparison fair, the same imputed datasets were administrated for all following Q-matrix comparisons (i.e., the expert-designed Q-matrix and the Q-matrix refined by the stepwise method and Chiu’s method).

Model evaluation criteria

A series of analyses were conducted within and across countries to provide answers to two research questions. Note that due to overlap between the two questions, some of the results provide answers to both. To deal with the first research question, model fit was investigated within and across countries. Specifically, Akaike’s Information Criterion (AIC; Akaike, 1974) and Bayesian Information Criterion (BIC; Schwarz, 1978) were used as relative fit indices; the limited-information version of Root Mean Square Error of Approximation (RMSEA2; Maydeu-Olivares & Joe, 2014; Ma & de la Torre, 2020a, b) and Standardized Root Mean Square Residual (SRMSR; Liu et al., 2016; Maydeu-Olivares, 2013) were used as absolute fit indices. As for the rule of thumb for these model-fit indices, lower AIC and BIC values indicate a better fit. Because clear guidelines for evaluating the G-DINA in terms of the RMSEA2 and SRMSR are still lacking in the current literature (von Davier & Lee, 2019), we chose to use “Below 0.05” for SRMSR and “Below 0.045” for RMSEA2 to indicate good model fit. These cut-offs are currently used in relevant research based on IRT models and loglinear cognitive diagnosis models (LCDM) respectively (Liu et al., 2016, 2017; Maydeu-Olivares et al., 2011). First, the best-fitting Q-matrix per country was determined by comparing the relative and absolute model fit criteria between the three different kinds of Q-matrices (i.e., original, stepwise, and Chiu’s Q-matrix) within each country. Then, mean rankings of model fit indices were calculated as a result of incorporating an expert-designed Q-matrix and the two refined Q-matrices in the G-DINA across five selected countries. The overall rank was calculated by (1) sorting the three Q-matrix types per country (e.g., 1 = best to 3 = worst for Finland), and then (2) averaging the given numbers over the five countries. That is, the Q-matrix ranked lowest indicates that it was able to provide the best model fit for the G-DINA analysis in general.

Given that different refinement methods were evaluated and the best-fitting solutions were chosen for each country, the second research question focused heavily on looking at their impact on the interpretation of TIMSS. Student attribute mastery percentages were calculated and compared within and across countries to verify if using different Q-matrices alters the estimation of diagnostic information from the TIMSS assessment in general and for each country separately. The calculation of percentages was based on an attribute mastery status of each student estimated by the G-DINA. Differences in student attribute mastery between the original and the refined Q-matrices larger than 10% were retained as a remarkable difference.

Next, we calculated degrees of (dis-)agreement between the original and the two refined Q-matrices for all attributes. When the estimated student mastery matrices based on two comparable Q-matrices classified a student in the same category (master (1) or non-master (0)) we considered this as an agreement. When they classified a student in a different category (classifying a non-master as a master or classifying a master as a non-master), we considered this as a disagreement. The percentages of mastery/non-mastery agreement were calculated for each pair of Q-matrices for each country separately. As an example, the original Q-matrix was compared to the Q-matrix refined by the stepwise method in terms of the mastery agreement rate for one attribute in a dataset with 100 students. The estimated student mastery matrix from the original Q-matrix can be compared to the one from the refined Q-matrix. To that end, the number of the same mastery entries can be counted (for instance giving a value of 80 if both approaches agreed that 80 students mastered this attribute). The number of the different mastery entries could be observed as well (for instance leading to a value of 20 if it was found that based on the original matrix 20 students mastered this attribute but the result based on the refined Q-matrix suggested they did not master this attribute). Then, the mastery agreement of this attribute between two Q-matrices in this dataset would be 80% (i.e., 80/(80 + 20)). This kind of calculation would be applied to each country and each attribute in the following analysis. The relevant results from five countries were combined to give an overall explanation of each attribute. The average results based on nine attributes were applied to present a general overview of differences in student classification between different Q-matrices across countries. Third, differences between Q-matrices concerning the interpretation of item parameter estimates, e.g., intercept parameter that measures baseline effect (i.e., the success probability without any required attribute), were scrutinized.

Results

In practice, Chiu’s method needs to set the condensation rule (i.e., “AND” or “OR”) due to the requirement of relevant R packages. Two condensation rules were both tried in the data analysis. Analyses based on “OR” rule did not converge, so the following parts only present results based on “AND” rule.

First, the expert-designed Q-matrix was compared to different refined Q-matrices. The results suggest that the Q-matrices refined based on the combined five-countries data are dissimilar to the expert-designed Q-matrix. The percentage of different entries for the stepwise method is 26.59% and the one for Chiu’s method is 15.61%. Also, the Q-matrices refined based on each country’s data differ from the expert-designed Q-matrix, with the differential proportion ranging from 2.5% to 22.97%. Additionally, the refined Q-matrix based on the combined five-countries data is dissimilar from the Q-matrix based on each country’s data for both the stepwise method and Chiu’s method, and the percentage of different entries ranges from 12.11% to 28.21%. Overall, it can be confirmed that the expert-designed Q-matrix is different from the refined Q-matrices, and those differences are worth exploring further.

Goodness of fit

Goodness of fit within countries

Considering that Chiu’s method was applied under the conjunctive assumption, in order to make the methods comparison fairer, the interaction parameters of the G-DINA and the model comparison between the G-DINA and the DINA were scrutinized. We find that most of the interaction parameters in the G-DINA are not zero and not close to zero when the expert-designed Q-matrix and the refined Q-matrix from the stepwise method or Chiu’s method are applied. Furthermore, the DINA and G-DINA were compared by model-fit indices and the likelihood ratio test based on each country’s data was analyzed. The results of model-fit indices present that almost all estimates support the G-DINA across five countries. Only the estimated SRMSR for the Finland data recommends the DINA. All likelihood ratio tests support the G-DINA. Hence, it can be confirmed that using the G-DINA for the model comparison is acceptable.Footnote 3

Table 3 gives the average AIC, BIC, RMSEA2, and SRMSR of the G-DINA for each refinement method per country. According to the relative fit indices, the Q-matrix refined by the stepwise method appears to be the best fitting Q-matrix regardless of the criterion used for four out of the five selected countries (except Tunisia). For Tunisia we see a difference, the BIC value identifies the Q-matrix refined by the stepwise method as the most appropriate one, but the estimated values of other indices support Chiu’s method. None of the model-fit indices supports the original expert-designed Q-matrix. Interestingly, none of the G-DINA yielded a value lower than the cut-off value for the good model fit of 0.045 (RMSEA2) and 0.05 (SRMSR). Most of them are between 0.06 and 0.12. Nevertheless, it is important to keep in mind that these cut-off values apply for LCDM and IRT models, and are not yet evaluated for the G-DINA. To make the conclusion of methods comparison solid, the five-countries-combined data was applied as well. The results indicates that fitting the G-DINA based on the Q-matrix refined by the stepwise method can produce a better model-fit evaluation than others, and the refined Q-matrix is always better than the expert-designed Q-matrix. Those findings are consistent with the results based on each country’s data.

Table 3 Results of G-DINA Model Fit

Goodness of fit across countries

The mean ranking of relative and absolute model fit indices for the different Q-matrices can be found in Table 4. The relative fit indices identify the Q-matrix refined by the stepwise method as the most suitable one across countries (mean rank = 1.17), followed by the Q-matrix refined by Chiu’s method. Across all selected countries, the original Q-matrix is the least preferred. The absolute fit indices suggest the Q-matrix refined by the stepwise method is the best one (mean rank = 1.25) as well, followed by the Q-matrix refined by Chiu’s method and the universal expert-designed Q-matrix, which is the same as the results of mean rank based on the relative fit indices.

Table 4 Mean ranking of relative and absolute model fit indices per q-matrix across countries

Student attribute mastery

Table 5 provides an overview of student mastery per attribute, per Q-matrix, and per country. Percentages in this table can be interpreted as follows: according to the original Q-matrix, 38.46% of the Finnish students have mastered the “Whole numbers and integers” attribute. To investigate differences between Q-matrix refinement approaches, we compared the percentages of the best fitting Q-matrix (i.e., Q-matrix refined by the stepwise method) according to the model-fit indices (indicated in bold in Table 5, see Goodness of fit) with the percentages of the original Q-matrix. We considered a difference in student mastery percentages of 10% or more (compared to the original percentages) as a remarkable difference. These percentages were indicated by an *.

Table 5 Percentages of attribute mastery using expert-designed and refined Q-matrices

First, we find the most notable differences between the best-fitting and the original Q-matrix for Singapore. Six attributes show differences in attribute mastery larger than 10% when comparing the original with the stepwise method’s Q-matrix. Except for “Fractions, decimals and proportions”, “Data organisation, representation and interpretation”, and “Probability”, other attributes all show significant differences in percentages with a range from − 22.66% (“Expressions, equations and functions”; (47.93–61.97)/61.97 in Table 5) to + 33.46% (“Whole numbers and integers”). Second, four attributes show large differences in student attribute mastery between the stepwise and original Q-matrix in the case of the USA, Australia, and Tunisia. For the USA this encompasses the attributes: “Fractions, decimals and proportions” (− 34.11%), “Measurement” (− 20.62%), “Lines, angles and shapes” (− 38.52%), and “Data organization, representation and interpretation” (+ 11.89%). For Australia, it includes the attributes: “Expressions, equations and functions” (− 10.75%), “Patterns” (− 32.29%), “Measurement” (− 12.98%), and “Probability” (− 16.18%). The results of Tunisia data show remarkable differences for “Whole numbers and integers” (12.67%), “Patterns” (− 69.73%), “Lines, angles and shapes” (13.79%), and “Data organization, representation and interpretation” (+ 25.89%). Third, for Finland, large differences in student attribute mastery between the best-fitting (the stepwise Q-matrix) and the original Q-matrix are found for three out of nine attributes with a range from − 39.86% (“Fractions, decimals and proportions”) to − 15.39% (“Whole numbers and integers”).

From Table 5 we can also derive remarkable student attribute mastery differences between the original and the Q-matrix of Chiu’s method across countries. Overall, we see a strong agreement between the original, stepwise, and Chiu Q-matrices. Yet, we find 21 remarkable differences between the original and the stepwise Q-matrix and 35 noticeable differences between the original and Chiu's Q-matrix across all countries and attributes. Although we find many differences between the original and the stepwise Q-matrix, we cannot distinguish tendencies for specific attributes. Whether the model based on the stepwise Q-matrix classifies more or less students as masters for a specific attribute depends on the country.

Overall classification of students as masters or non-masters

Table 6 presents the range of agreement rates in the classification of students as masters or non-masters of nine attributes between three different kinds of Q-matrices (original, stepwise, and Chiu's Q-matrix) for the five selected countries. The average percentage of nine attributes regarding the mastery or non-mastery agreement rate gives a general impression for three Q-matrices. First, we see that the original Q-matrix has high agreement rates with the Q-matrices refined by the stepwise method. The agreement rates of mastery and non-mastery are both over 80%. In contrast, agreement rates between Chiu's method and other methods are smaller (mostly between 70 and 75%). Nevertheless, the three different Q-matrices agree on the classification of most of the students as masters or non-masters with rates higher than 70%. In addition, we tried to explore the reason for the inference consequences. The reason could be the overall Q-matrix misfit or the across-countries differences. The Q-matrix refined based on the five-countries data was included in the comparison to clarify this interesting question. The results in Appendix 4 indicate that both could contribute to the inference consequences. Currently, there is no clear pattern.

Table 6 Mastery/non-mastery agreement rates (%) for student mastery between each pair of Q-matrices

Estimated item parameters

Figure 2 shows five scatter plots (corresponding to the five countries) that represent the item parameter estimates of the G-DINA, specifically the intercept estimate (= \({\widehat{\delta }}_{j0}\)) where the original Q-matrix (x-axis) and the two refined Q-matrices (y-axis) are used. Recall that the intercept parameter refers to the probability of correctly solving the item \(j\) without mastering any required attribute(s) (i.e., the baseline effect). For each plot, the estimates (= dots) that involve the stepwise and Chiu’s methods are colored in blue and red, respectively. The dots that fall on the straight line in the plot suggest that the estimates between the original and the refined ones are an exact match. The results show that the intercept estimates using the stepwise method are more similar to estimates of the original Q-matrix than Chiu’s method, especially for Finland, Singapore, Australia, and Tunisia. The estimates using Chiu’s method (in red) are in general (but more noticeably for Singapore, USA, and Australia) lower than the estimates using the original one and, therefore, appear to deviate remarkably around the bottom right corner. Interestingly, our findings in this figure align with the previous classification results (Table 6) where the agreement rates between the original Q-matrix and the stepwise Q-matrix is larger.

Discussion

In this study, we examined the impact of a misspecified expert-designed Q-matrix for CDA with the application of international large-scale data (TIMSS 2011 8th grade mathematics). Specifically, our study paid particular attention to the recognition of different Q-matrices for an assessment that was refined by differentiated data structures by countries. First, the performance of the G-DINA using refined Q-matrices as compared to the expert-design Q-matrix was examined with regard to model fit criteria. Our investigation of the TIMSS data made clear that the original Q-matrix that was designed by experts without regard to countries failed to produce an equally good model fit as the two refined Q-matrices: the stepwise validation method (Ma & de la Torre, 2020a, b) or Chiu’s nonparametric classification method (Chiu, 2013). This finding, to some extent, naturally justifies the use of the country-specific Q-matrix that was refined by each country separately. While there are equally pros and cons (as mentioned in "Q-matrix refinement methods" section) of the two refinement methods that researchers and practitioners must consider, we found from the TIMSS data that the stepwise method suggested a better model fit than Chiu’s method across the countries. The refined Q-matrices by the stepwise method for the five selected countries are provided in Appendix 5. Next, we found that using the G-DINA with the stepwise Q-matrix was noticeably different from the expert-designed Q-matrix in terms of probabilities of attribute mastery, classification accuracy, and item parameter estimation. Furthermore, the impact of using the refined Q-matrices varied across countries.

Several of our findings merit further discussion. First, researchers and practitioners need to consider the advantages and disadvantages of using country-specific Q-matrices. One possible advantage of using it is that the expert-designed (or original) Q-matrix is refined specifically by each country separately. Therefore, any difference in the refined outcomes (i.e., number of attributes required for the item) among countries for an assessment could suggest the country’s unique instructional contents. On the other hand, one disadvantage of using the country-specific Q-matrix for CDA for international comparison studies is that it may be unfair (or biased) to directly compare student attribute mastery across countries because the Q-matrix applied in the fitted model was different from country to country after the country-specific refinement, which produced unfair conditions for the across-countries comparison. Another possible disadvantage is that regarding retrofitting CDMs and refining the Q-matrix, the design of large-scale assessments may not satisfy the completeness of the Q-matrix or identifiability conditions, which are required for identifying proficiency classes and estimating model parameters (Köhn & Chiu, 2016). Overall, we believe that consideration of the population heterogeneity for Q-matrix refinement is a relatively new and unexplored topic in the area; further research is needed for the appropriate use of the refined Q-matrix in real-life assessments.

Notwithstanding the unique insights offered by the current study, there remain some limitations to be considered. First, the two Q-matrix refinement approaches are still contingent on the original Q-matrix designed by experts with respect to the q-entries replacements. Therefore, altering the number of attributes specified in the Q-matrix is beyond the scope of the current study. Furthermore, it is important to notice that when the q-entries in the Q-matrix change, the definition, and interpretation of the attributes may change to some degree. In this way, the best fitting Q-matrix may not necessarily be interpretable or have practical value (Bradshaw et al., 2014; de la Torre, 2008). Moreover, no consensus exists in the current CDA and G-DINA literature regarding the amount and content of specified attributes in the TIMSS mathematics Q-matrix. Researchers specify different Q-matrices for the same test and compare countries according to the predefined attributes (e.g., Im & Park, 2010; Park et al., 2017; Sedat & Arican, 2015). Therefore, a universal attribute design of the Q-matrix remains a critical issue within the current CDA literature (Groß et al., 2016).

Second, the original Q-matrix used in this study was established after the test items in TIMSS were calibrated by a unidimensional IRT. Retrofitting CDMs to the data may result in an unbalanced Q-matrix where some attributes are measured significantly more than others (Sedat & Arican, 2015). In this study, this was most pronounced for the ‘probability’ attribute that was only measured by five items in the original Q-matrix. This imbalance can distort the attribute classification of students because a small number of items per attribute can generate a situation in which responses to one or a couple of items determines the student’s mastery of that attribute (Jurich & Bradshaw, 2014). Therefore, if we want to increase the validity and reliability of student attribute mastery patterns estimated by CDMs, we recommend defining a relevant set of attributes first and then writing items that tap these attributes instead of the other way around (Birenbaum et al., 2005; Bradshaw et al., 2014).

Third, this study is explorative in nature, and some approaches to investigating differences between Q-matrices are very crude. For example, no reliability estimates of student attribute mastery estimates (e.g., Sessoms & Henson, 2018) are provided, nor did we perform any significance tests to investigate differences in student attribute mastery between different Q-matrices. In addition, we do not include relevant domain experts from a specific country to examine whether the refined country-specific Q-matrices recommended by the stepwise method and Chiu’s method are meaningful or better than the original universal Q-matrix. We want to stress that the data-driven results are sample-dependent and generalization to other countries or time points may not be warranted. Moreover, they should be used on sufficiently large datasets, but it is not clear yet when a sample size can be considered sufficiently large. Therefore, we recommend the data-driven results to be double-checked in tandem with domain experts, or be used as ancillary information when the Q-matrix is discussed somehow. Relying on the data-driven results solely and blindly is certainly not recommended.

Conclusions

This study provided useful insights concerning differences between Q-matrix refinement techniques, the country-specificity of Q-matrices, and the consequences for practice. Findings from this study could help optimize the Q-matrix so CDMs (e.g., G-DINA) can be more widely used to extract diagnostic attribute-level information out of international comparative tests. Together with the expertise of domain experts, teachers and policymakers could use this fine-grained information to tailor their instruction to students’ specific weaknesses and link this diagnostic information to the existing curricula and instructional practices of their particular country. In this way, CDMs are a crucial diagnostic information source that could help improve education systems all over the world.

Fig. 1
figure 1

Example item of the TIMSS 2011 8th grade mathematics test (Mullis et al., 2012)

Fig. 2
figure 2

Intercept Parameters per item for the four G-DINA models per Country

Fig. 3
figure 3

Flowchart of the Stepwise Method. This flowchart was reprinted from Ma and Torre (2020b)

Fig. 4
figure 4

Flowchart of Chiu’s method. This flowchart was created based on the explanations of Chiu (2013)

Availability of data and materials

The datasets analyzed during the current study are TIMSS 2011 at https://timssandpirls.bc.edu/timss2011/international-database.html.

Notes

  1. An attribute can be defined as a “skill or content knowledge that is required to solve a test item.” (Choi et al., 2015).

  2. The expert-designed Q-matrix can be found in Appendix 3.

  3. Relevant results can be consulted by the following link “https://github.com/supplement-material/Q-matrix-paper”.

References

Download references

Acknowledgements

Not applicable.

Funding

The authors received no specific funding for this work.

Author information

Authors and Affiliations

Authors

Contributions

JD and CC contributed equally to this work as first authors. They prepared the initial draft of the analyses and manuscript. JYP and WVN supervised the study and revised and commented on the draft. All authors read and approved the final manuscript.

Author’s information

Jolien Delafontaine is a PhD student at the Faculty of Psychology and Educational Sciences, Parenting and Special education Unit of the KU Leuven. Her doctoral research focuses on effective teaching for students with special educational needs (SEN).

Changsheng Chen is a PhD student at the Faculty of Psychology and Educational Sciences, and the imec research group itec of the KU Leuven. His doctoral research focuses on learning analytics.

Jung Yeon Park is an assistant professor of quantitative research methods at George Mason University. Her research focuses on cognitive diagnosis models, large-scale educational assessments, and learning analytics.

Wim Van den Noortgate is a professor of statistics at the Faculty of Psychology and Educational Sciences, and the imec research group itec of the KU Leuven. His major interests include learning analytics and meta-analysis.

Corresponding author

Correspondence to Changsheng Chen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Flowcharts of the Stepwise Method and Chiu’s Method

See Figs. 3, 4.

Appendix 2: Attributes of the Original Q-matrix

See Table

Table 7 Attributes, their explanation and frequency in the expert-designed q-matrix

7.

Appendix 3: Q-matrix for 89 selected items of TIMSS 2011

See Table

Table 8 Q-matrix designed by experts for 89 Items of TIMSS 2011 8th grade mathematics test

8.

Appendix 4: Mastery/Non-mastery Agreement Rates (%) between each pair of Q-matrices

See Table

Table 9 Mastery/non-mastery agreement rates (%) between each pair of Q-matrices based on five-countries and country-specific data

9.

Appendix 5: Refined Q-matrix for five selected countries based on the stepwise method

See Tables

Table 10 Q-matrix refined by the stepwise method: Finland

10,

Table 11 Q-matrix Refined by the Stepwise Method: United States

11,

Table 12 Q-matrix refined by the stepwise method: Singapore

12,

Table 13 Q-matrix refined by the stepwise method: Australia

13,

Table 14 Q-matrix refined by the stepwise method: Tunisia

14.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Delafontaine, J., Chen, C., Park, J.Y. et al. Using country-specific Q-matrices for cognitive diagnostic assessments with international large-scale data. Large-scale Assess Educ 10, 19 (2022). https://doi.org/10.1186/s40536-022-00138-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40536-022-00138-4

Keywords

  • G-DINA
  • Q-matrix refinement
  • Stepwise validation method
  • Nonparametric classification method
  • TIMSS 2011 mathematics
  • International comparison