 Research
 Open Access
 Published:
Using countryspecific Qmatrices for cognitive diagnostic assessments with international largescale data
Largescale Assessments in Education volume 10, Article number: 19 (2022)
Abstract
In cognitive diagnosis assessment (CDA), the impact of misspecified itemattribute relations (or “Qmatrix”) designed by subjectmatter experts has been a great challenge to realworld applications. This study examined parameter estimation of the CDA with the expertdesigned Qmatrix and two refined Qmatrices for international largescale data. Specifically, the GDINA model was used to analyze TIMSS data for Grade 8 for five selected countries separately; and the need of a refined Qmatrix specific to the country was investigated. The results suggested that the two refined Qmatrices fitted the data better than the expertdesigned Qmatrix, and the stepwise validation method performed better than the nonparametric classification method, resulting in a substantively different classification of students in attribute mastery patterns and different item parameter estimates. The results confirmed that the use of countryspecific Qmatrices based on the GDINA model led to a better fit compared to a universal expertdesigned Qmatrix.
Introduction
International comparative assessments such as PISA (Programme for International Student Assessment) or TIMSS (Trends in International Mathematics and Science Study) have the power to influence educational policy and practice to a large extent (Sedat & Arican, 2015). Item response theory (IRT; Baker, 2001) has traditionally been used to analyze such largescale assessments and to provide information about students’ abilities. This method summarizes students’ overall ability in a particular subject (e.g., mathematics, reading, or science) by means of a single ability score (Chen, 2017; Nájera et al., 2019). Student achievement is then compared across countries and an international benchmark is set. Unfortunately, the general ability score provides neither teachers nor policymakers with the finegrained diagnostic information necessary to determine if students have mastered a particular domain. This makes the implementation of a targeted educational strategy based on international largescale assessments difficult. Cognitive diagnosis assessment (CDA) allows to understand students’ assessment outcomes by finegrained attributes^{Footnote 1} that are directly related to students’ success in a given subject domain so that statistical data analysis may provide richer information regarding what types of attributes students have mastered (Jurich & Bradshaw, 2014).
The Generalized Deterministic Inputs, Noisy “And” Gate model (GDINA; de la Torre, 2011), one of the popularly used psychometric models for the CDAs, can be used for just this purpose. The GDINA aims to measure to what extent students master a set of cognitive attributes (e.g., fractions, proportions, and decimals as a finegrained cognitive mathematic attribute) to improve educational policy and practice. Some studies have already employed the GDINA to analyze data from international comparative tests, such as PISA (Jia et al., 2021; Wu et al., 2020). However, before the GDINA can be used on international assessments to perform the CDA, a content analysis of the test must occur (von Davier & Lee, 2019). Domain experts conduct this analysis to identify a set of related attributes or skills that measure a few broad domains and to define each item by the subset of attributes (Nájera et al., 2019). Researchers refer to such an internal structure where the itemattribute relations are specified as a “Qmatrix” (Tatsuoka, 1984). This twodimensional matrix with items and attributes defining rows and columns, respectively, includes only one (the attribute is required to solve the item) or zero (the attribute is not required to solve the item).
Currently, using the Qmatrix for international tests has two major drawbacks. First, the Qmatrix is designed by the judgements of experts. In other words, content experts specify the cognitive attributes and their relations with the items. The expertdesigned Qmatrix is not always perfect and it could be possible to have some misspecifications. When the content analysis of the Qmatrix is conducted by the fallible judgements of subjectmatter experts (Chen, 2017; Nájera et al., 2019; Terzi & de la Torre, 2018), misspecifications of the Qmatrix could have serious consequences for the estimation of students’ attribute patterns and the interpretation of the data consequently (de la Torre & Chiu, 2016; Köhn & Chiu, 2018; Nájera et al., 2019). Additionally, experts from different countries may have disagreements on the relationship between items and attributes because of their different educational backgrounds, countryspecific curriculum, and teaching situations. Those disagreements may produce uncertainty about the Qmatrix, which provides space for further improvements as well. Because of those, researchers have proposed various refinement methods (Chiu, 2013). This study addresses the potential impact of a misspecified Qmatrix on the CDA in TIMSS and explores the performance of refined Qmatrices.
Second, the same Qmatrix is specified for every participating country. The question that arises in the literature is that, despite the prior work on this issue, little evidence supports the use of a common Qmatrix for different types of population groups. That is, we cannot simply assume that the expertdesigned universal Qmatrix from the TIMSS will perform similarly when used to analyze data from different countries. Further, we seek to determine if a particular refinement approach performs better across various countries.
In the following section, we describe the conceptual background of the GDINA and two different Qmatrix refinement methods. Next, we apply these techniques to study whether different Qmatrix refinement approaches come to different solutions within and between countries, and what solution provides the best model fit. Finally, we investigate the usefulness of countryspecific Qmatrices.
GDINA
The GDINA belongs to the family of Cognitive Diagnostic Models (CDMs; Rupp et al., 2010), which is considered as a special case of Latent Class Models (LCMs; Hagenaars & McCutcheon, 2002) where the attribute patterns are modelled to categorize students by means of latent class variables. To be specific, the attribute patterns of the students are conceptually unobservable and therefore have to be measured by their observed responses to a set of items in a test (Chen, 2017). CDMs are confirmatory models in nature because the relationships between the categorical latent variables (attributes) and the test items are defined a priori in a Qmatrix (Ravand & Robitzsch, 2015). There are a number of different modelling approaches within CDMs that have been in use, depending on how relationships between attributes and item responses are modelled and how attributes themselves are combined (Ravand & Robitzsch, 2015; Rupp et al., 2010).
Many CDMs assume certain relationships between items and attributes, such as the Deterministic, Inputs, Noisy, “And” Gate model (DINA; de la Torre, 2009; Junker & Sijtsma, 2001) which assumes that attributes are conjunctive, the Deterministic, Inputs, Noisy, “Or” Gate model (DINO; Templin & Henson, 2006) which assumes that attributes are disjunctive, and so forth. However, sometimes, the attribute relationship is unclear before the model application. In that respect, the GDINA seems to be a reasonable choice to fit reallife data because it does not have constraints for the relationship of attributes (i.e., conjunctive, disjunctive, and additive assumption for attributes; de la Torre, 2011).
A saturated GDINA for dichotomous responses with identity link is expressed as follows (de la Torre, 2011; von Davier & Lee, 2019). The model estimates the probability of success for item j under different attribute patterns. k refers to a required attribute based on Qmatrix and \({K}_{j}^{*}\) is the total number of required attributes for item j. \({\boldsymbol{\alpha }}_{{\varvec{l}}{\varvec{j}}}^{*}\) denotes the reduced attribute vector for item j,\(l={1, ..., 2}^{{K}_{j}^{*}}\), which keeps only required attributes. \(P\left({X}_{j}=1  {\boldsymbol{\alpha }}_{{\varvec{l}}{\varvec{j}}}^{*}\right)\) denotes the probability of the correct answer to item j conditional on attribute pattern\({\boldsymbol{\alpha }}_{{\varvec{l}}{\varvec{j}}}^{*}\). For instance, in a test designed for measuring four attributes, answering item j correctly needs the 2nd, 3rd, and 4th attribute (i.e.,\({K}_{j}^{*}=3\)), and the reduced attribute pattern is denoted by\({\boldsymbol{\alpha }}_{{\varvec{l}}{\varvec{j}}}^{*}={\left({\alpha }_{l2},{\alpha }_{l3},{\alpha }_{l4}\right)}^{\mathrm{^{\prime}}}\).
In addition, \({\delta }_{j0}\) is the intercept indicating the baseline probability (i.e., the success probability without any required attribute being mastered); \({\delta }_{jk}\) is the main effect, namely the change in probability when students master a single attribute \({\alpha }_{k}\); \({\delta }_{jk{k}^{\mathrm{^{\prime}}}}\) is the firstorder interaction effect due to \({\alpha }_{k}\) and \({\mathrm{\alpha }}_{{k}^{\mathrm{^{\prime}}}}\), indicating the change in probability with the mastery of two attributes; \({\delta }_{j12\dots {K}_{j}^{*}}\) is the effect of the mastery of all attributes. The higherorder interaction effects can be added to the model, but are skipped here due to parsimony (denoted by three dots in the formula). Parameters of the GDINA are estimated by an expectation–maximization implementation of marginalized maximum likelihood estimation (de la Torre, 2011).
Qmatrix refinement methods
To handle uncertainty about some of the itemattribute relations in the Qmatrix, several studies proposed parametric (de la Torre & Chiu, 2016) and nonparametric (Chiu, 2013; Desmarais & Naceur, 2013) approaches to validate and refine the initial design. It is important to notice that these methods are all confirmatory in nature in the sense that they refine the Qmatrix initially designed by experts (Nájera et al., 2019). Furthermore, these refinement methods only handle misspecifications of how items are linked to the attributes (misspecification of rows in the Qmatrix or qvectors) and not misspecification of the set of underlying attributes (misspecification of the columns of the Qmatrix) (Chiu, 2013). Lastly, these refinement methods are datadriven, so they use students’ answers to the test to construct a refined Qmatrix. Thus, a Qmatrix refined by the same method but using a different dataset (e.g., the TIMSS 2011 8th grade mathematics scores from a different country), is therefore likely to be different.
Stepwise validation method
The stepwise validation method for Qmatrix refinement was proposed by Ma and de la Torre (2020a, b). It combines the GDINA Discrimination Index (GDI; de la Torre & Chiu, 2016) and the Wald statistic based on the GDINA given an expertdefined Qmatrix, which can be regarded as an extended version of the GDI method. This method is originally designed for graded response data in tandem with the sequential GDINA. When dichotomous items are applied, this method can still work based on the GDINA. Compared to other Qmatrix refinement methods, the stepwise validation method does not need assumptions about the processing function and it can consider the item parameter estimation errors because of including the Wald statistic (Ma & de la Torre, 2020a, b). The mechanism of this method can be simply explained by two steps. The first step is to select required attributes by using GDI and the second step is to compare candidate attributes based on the Wald statistic. The criteria corresponding to GDI for selecting the bestrecommended qvector is Proportion of Variance Accounted For (PVAF). de la Torre and Chiu (2016) recommended 0.95 as the rule of thumb for PVAF. When the PVAF of qvectors is larger than 0.95, a qvector with the lowest number of attributes will be recommended.
The operation is explained in detail and a flowchart can be consulted in Appendix 1 (Fig. 3). For a certain item, two sets of attributes are defined, including a set A of all required attributes and a set B of all target (or candidate) attributes that need to be tested. A qvector search bank C with all singleattribute competing qvectors is defined as well. Initially, A is an empty set and B is a set with all attributes. The first step is to replace the provisional qvectors in an expertdesigned Qmatrix with the competing qvector in C for the target attribute in B, and calculate relevant PVAF (Ma & de la Torre, 2020a, b). The target attribute with the highest PVAF is defined as a required attribute, and it will be moved from B to A. The second step is to examine whether the qvectors with required attributes from A are recommended by GDI (i.e., PVAF > 0.95; Ma & de la Torre, 2020a, b). When the PVAF of the qvector with the required attribute from A is higher than 0.95, the validation process will stop and it means the required attribute in A is validated. If not, the search bank C will be updated where the competing qvector becomes a vector with all required attributes in A and one target attribute in B. There will be some competing qvectors with all required attributes in A and different target attributes in B simultaneously. Then, the target attributes need to be examined whether they are necessary for competing qvectors by performing tests based on the Wald statistic. If the Wald statistic suggests all target attributes are not necessary, the validation procedure will stop. Otherwise, at least one target attribute can be recommended. Among them, the one in the competing qvector with the highest PVAF is regarded as the required attribute, and it will be moved from B to A. After that, the required attribute is examined by the Wald statistic, and the unnecessary one will be moved from A to B. The procedure is iterated until no attributes can be added or removed between set A and set B (Ma & de la Torre, 2020a, b).
Some limitations of this method have been identified. In the proposed algorithm, the cutoff value for PVAF in the part of GDI is fixed to 0.95, which has been criticized. Liu (2015) and Wang et al. (2018) supported that the cutoff value should be adjusted based on sample size. Another limitation is the Wald statistic. As Ma and de la Torre (2020a, b) said, the Wald test is an important component of the stepwise validation method and its performance can be further improved when using a betterestimated variance–covariance matrix (Liu et al., 2019).
Chiu’s nonparametric classification method
Chiu (2013) also proposed a method to identify and correct mispecified qvectors in a Qmatrix. The method is based on the nonparametric classification method and comparisons of the residual sum of squares (RSS) between the observed and predicted responses among all the possible Qmatrices given an expertbased Qmatrix. The algorithm consists of various steps. A flowchart is added in Appendix 1 (Fig. 4) to clarify the different steps of the algorithm (Chiu, 2013). The algorithm begins by selecting the item with the highest RSS, which is most likely to be misspecified, and the qvector that should be updated. Then, the algorithm searches over all possible qvectors and replaces the qvector under consideration with the one with the lowest RSS. The algorithm is an iterative procedure where it will stop when all items are visited and the RSS of each item hardly changes anymore (Chiu, 2013). An advantage of this method, compared to modelbased methods, is that it does not rely on the model parameters of CDMs when optimizing the algorithm. Furthermore, the approach guarantees good student classification even when the true CDM underlying the observed item responses is unknown. Performance, effectiveness, efficiency, and applicability were proven through simulation studies by Chiu (2013).
One major limitation of this method is that it is unable to handle missing data in the dataset. Because of the booklet design of many largescale assessments (including TIMSS), the missingness by design is a common feature of these studies. Thus, we need to impute all the missing data before Chiu’s method can be performed. Second, Chiu’s method is a nonparametric method while parametric models should provide more powerful results when the distributional assumptions are not violated, especially for large samples (Terzi & de la Torre, 2018).
The overarching goal of this study is to explore the adequateness of (1) the selected Qmatrix refinement techniques and (2) the country specificity of Qmatrices in the case of the TIMSS 2011 8th grade mathematics test. This leads to the following research questions:

Do the countryspecific refined Qmatrices offer a better model fit than the original Qmatrix designed by domain experts? If so, is there a particular refinement method that performs better (or worse) than other methods?

Does using the countryspecific Qmatrices with the best model fit alter the interpretation of diagnostic assessments for the TIMSS 2011 eighthgrade mathematics assessment? If so, does this impact differ across countries?
Materials and methods
TIMSS 2011 8th grade mathematics
For this study, we used data from the TIMSS 2011 eighthgrade mathematics assessment. Specifically, we used student responses to 89 items that were released in the TIMSS database. They were comprised of 48 multiple choice items, 32 openended questions, and 9 constructed response questions. A score of 1 was given to a completely correct answer and a score of 0 was given to a partly correct or wrong answer. Omitted items were scored as incorrect (0) and missing items by design were scored as nonavailable (NA). Two criteria were used to select countries in the database: (1) the TIMSS 2011 results of the country were reliably measured (according to Mullis et al., 2012); and (2) the five countries are part of different continents. Consequently, we chose Finland (Europe), the USA (North America), Singapore (Asia), Australia (Oceania), and Tunisia (Africa) for the study. Table 1 provides the sample sizes from five countries.
Qmatrix for 8th grade TIMSS 2011 mathematics
In order to analyze data with the GDINA, subjectmatter experts must prepare a Qmatrix for the test items. Details about constructing Qmatrices with regard to TIMSS 2007 and 2011 are presented in previous studies (Johnson et al., 2013). In essence, the procedure begins with four content domains specified in the original TIMSS framework (i.e., number, algebra, geometry, and data & chance) where the domains are further described by multiple topic areas and the accompanying 55 objectives that are a part of math curricula from a majority of countries. For TIMSS 2011 8th grade mathematics assessment, the experts combined more related objectives and defined a total of nine attributes comprised of 89 items. The list of attributes and the number of items involved in the Qmatrix can be found in Table 2. Descriptions of those attributes are available in Johnson et al. (2013) (see the excerpt in Appendix 2). The Qmatrix for those released item sets in the TIMSS 2011 8th grade mathematics assessment is available in Park et al. (2017).^{Footnote 2} Figure 1 gives an example of a multiplechoice question among the released items. According to the Qmatrix, this item requires mastery of two attributes, expressions, equations and functions and measurement to be answered correctly.
Analyses
Qmatrix refinements and GDINA
The research questions were investigated empirically by fitting the GDINA using the original and refined Qmatrices. First, the expertdesigned Qmatrix (for the 89 selected items) was validated and refined based on the five selected countries’ data as well as on the combined data from the five countries by two selected refinement methods: (1) stepwise validation method (further referred to as stepwise method) (2) Chiu’s nonparametric classification method (further referred to as Chiu’s method). Next, the Qmatrices of each of the five countries and of the combined data were compared with the expertdesigned Qmatrix. Thereafter, the analysis of the GDINA with the different Qmatrices (i.e., one expertdesigned Qmatrix and two refined Qmatrices based on two refinement methods) was conducted for each country. To avoid the problem of using the same data for refining the Qmatrix and estimating modelfit indices to make relevant evaluations stable and reliable, the data of each country were divided into two parts: a random subset of 50% of the data was used for Qmatrix refinement and other 50% for the GDINA estimation. This operation was repeated ten times and the average value for each modelfit index was used as the evidence for conclusions. We used R 4.1.2 (R Core Team, 2021) with the GDINA package (version 2.8.8; Ma & de la Torre, 2020a, b) and the NPCD package (version 1.0–11; Zheng et al., 2019) to perform the Qmatrix refinements and GDINA analysis.
Handling missing data
Since Chiu’s method is unable to handle missing data, we needed to replace the missing data with substitute values. To this end, the common imputation method of Predictive Mean Matching (PMM; Little, 1988; Rubin, 1986) was used for each column of the data. The basic idea of PMM is that for each missing value, the method forms a small set of candidate values and matches one of those observed values for the corresponding missing cell. By using PMM, all missing entries in the different datasets could be replaced. It is worth noting that most of the missings in the datasets were in fact produced by test design (i.e., booklet format) and the planned missings are considered as missing completely at random, so the analysis based on imputed data does not cause biased results (Little & Rubin, 2002). In order to make relevant estimated results reliable and make the methods comparison fair, the same imputed datasets were administrated for all following Qmatrix comparisons (i.e., the expertdesigned Qmatrix and the Qmatrix refined by the stepwise method and Chiu’s method).
Model evaluation criteria
A series of analyses were conducted within and across countries to provide answers to two research questions. Note that due to overlap between the two questions, some of the results provide answers to both. To deal with the first research question, model fit was investigated within and across countries. Specifically, Akaike’s Information Criterion (AIC; Akaike, 1974) and Bayesian Information Criterion (BIC; Schwarz, 1978) were used as relative fit indices; the limitedinformation version of Root Mean Square Error of Approximation (RMSEA_{2}; MaydeuOlivares & Joe, 2014; Ma & de la Torre, 2020a, b) and Standardized Root Mean Square Residual (SRMSR; Liu et al., 2016; MaydeuOlivares, 2013) were used as absolute fit indices. As for the rule of thumb for these modelfit indices, lower AIC and BIC values indicate a better fit. Because clear guidelines for evaluating the GDINA in terms of the RMSEA_{2} and SRMSR are still lacking in the current literature (von Davier & Lee, 2019), we chose to use “Below 0.05” for SRMSR and “Below 0.045” for RMSEA_{2} to indicate good model fit. These cutoffs are currently used in relevant research based on IRT models and loglinear cognitive diagnosis models (LCDM) respectively (Liu et al., 2016, 2017; MaydeuOlivares et al., 2011). First, the bestfitting Qmatrix per country was determined by comparing the relative and absolute model fit criteria between the three different kinds of Qmatrices (i.e., original, stepwise, and Chiu’s Qmatrix) within each country. Then, mean rankings of model fit indices were calculated as a result of incorporating an expertdesigned Qmatrix and the two refined Qmatrices in the GDINA across five selected countries. The overall rank was calculated by (1) sorting the three Qmatrix types per country (e.g., 1 = best to 3 = worst for Finland), and then (2) averaging the given numbers over the five countries. That is, the Qmatrix ranked lowest indicates that it was able to provide the best model fit for the GDINA analysis in general.
Given that different refinement methods were evaluated and the bestfitting solutions were chosen for each country, the second research question focused heavily on looking at their impact on the interpretation of TIMSS. Student attribute mastery percentages were calculated and compared within and across countries to verify if using different Qmatrices alters the estimation of diagnostic information from the TIMSS assessment in general and for each country separately. The calculation of percentages was based on an attribute mastery status of each student estimated by the GDINA. Differences in student attribute mastery between the original and the refined Qmatrices larger than 10% were retained as a remarkable difference.
Next, we calculated degrees of (dis)agreement between the original and the two refined Qmatrices for all attributes. When the estimated student mastery matrices based on two comparable Qmatrices classified a student in the same category (master (1) or nonmaster (0)) we considered this as an agreement. When they classified a student in a different category (classifying a nonmaster as a master or classifying a master as a nonmaster), we considered this as a disagreement. The percentages of mastery/nonmastery agreement were calculated for each pair of Qmatrices for each country separately. As an example, the original Qmatrix was compared to the Qmatrix refined by the stepwise method in terms of the mastery agreement rate for one attribute in a dataset with 100 students. The estimated student mastery matrix from the original Qmatrix can be compared to the one from the refined Qmatrix. To that end, the number of the same mastery entries can be counted (for instance giving a value of 80 if both approaches agreed that 80 students mastered this attribute). The number of the different mastery entries could be observed as well (for instance leading to a value of 20 if it was found that based on the original matrix 20 students mastered this attribute but the result based on the refined Qmatrix suggested they did not master this attribute). Then, the mastery agreement of this attribute between two Qmatrices in this dataset would be 80% (i.e., 80/(80 + 20)). This kind of calculation would be applied to each country and each attribute in the following analysis. The relevant results from five countries were combined to give an overall explanation of each attribute. The average results based on nine attributes were applied to present a general overview of differences in student classification between different Qmatrices across countries. Third, differences between Qmatrices concerning the interpretation of item parameter estimates, e.g., intercept parameter that measures baseline effect (i.e., the success probability without any required attribute), were scrutinized.
Results
In practice, Chiu’s method needs to set the condensation rule (i.e., “AND” or “OR”) due to the requirement of relevant R packages. Two condensation rules were both tried in the data analysis. Analyses based on “OR” rule did not converge, so the following parts only present results based on “AND” rule.
First, the expertdesigned Qmatrix was compared to different refined Qmatrices. The results suggest that the Qmatrices refined based on the combined fivecountries data are dissimilar to the expertdesigned Qmatrix. The percentage of different entries for the stepwise method is 26.59% and the one for Chiu’s method is 15.61%. Also, the Qmatrices refined based on each country’s data differ from the expertdesigned Qmatrix, with the differential proportion ranging from 2.5% to 22.97%. Additionally, the refined Qmatrix based on the combined fivecountries data is dissimilar from the Qmatrix based on each country’s data for both the stepwise method and Chiu’s method, and the percentage of different entries ranges from 12.11% to 28.21%. Overall, it can be confirmed that the expertdesigned Qmatrix is different from the refined Qmatrices, and those differences are worth exploring further.
Goodness of fit
Goodness of fit within countries
Considering that Chiu’s method was applied under the conjunctive assumption, in order to make the methods comparison fairer, the interaction parameters of the GDINA and the model comparison between the GDINA and the DINA were scrutinized. We find that most of the interaction parameters in the GDINA are not zero and not close to zero when the expertdesigned Qmatrix and the refined Qmatrix from the stepwise method or Chiu’s method are applied. Furthermore, the DINA and GDINA were compared by modelfit indices and the likelihood ratio test based on each country’s data was analyzed. The results of modelfit indices present that almost all estimates support the GDINA across five countries. Only the estimated SRMSR for the Finland data recommends the DINA. All likelihood ratio tests support the GDINA. Hence, it can be confirmed that using the GDINA for the model comparison is acceptable.^{Footnote 4}
Table 3 gives the average AIC, BIC, RMSEA_{2}, and SRMSR of the GDINA for each refinement method per country. According to the relative fit indices, the Qmatrix refined by the stepwise method appears to be the best fitting Qmatrix regardless of the criterion used for four out of the five selected countries (except Tunisia). For Tunisia we see a difference, the BIC value identifies the Qmatrix refined by the stepwise method as the most appropriate one, but the estimated values of other indices support Chiu’s method. None of the modelfit indices supports the original expertdesigned Qmatrix. Interestingly, none of the GDINA yielded a value lower than the cutoff value for the good model fit of 0.045 (RMSEA_{2}) and 0.05 (SRMSR). Most of them are between 0.06 and 0.12. Nevertheless, it is important to keep in mind that these cutoff values apply for LCDM and IRT models, and are not yet evaluated for the GDINA. To make the conclusion of methods comparison solid, the fivecountriescombined data was applied as well. The results indicates that fitting the GDINA based on the Qmatrix refined by the stepwise method can produce a better modelfit evaluation than others, and the refined Qmatrix is always better than the expertdesigned Qmatrix. Those findings are consistent with the results based on each country’s data.
Goodness of fit across countries
The mean ranking of relative and absolute model fit indices for the different Qmatrices can be found in Table 4. The relative fit indices identify the Qmatrix refined by the stepwise method as the most suitable one across countries (mean rank = 1.17), followed by the Qmatrix refined by Chiu’s method. Across all selected countries, the original Qmatrix is the least preferred. The absolute fit indices suggest the Qmatrix refined by the stepwise method is the best one (mean rank = 1.25) as well, followed by the Qmatrix refined by Chiu’s method and the universal expertdesigned Qmatrix, which is the same as the results of mean rank based on the relative fit indices.
Student attribute mastery
Table 5 provides an overview of student mastery per attribute, per Qmatrix, and per country. Percentages in this table can be interpreted as follows: according to the original Qmatrix, 38.46% of the Finnish students have mastered the “Whole numbers and integers” attribute. To investigate differences between Qmatrix refinement approaches, we compared the percentages of the best fitting Qmatrix (i.e., Qmatrix refined by the stepwise method) according to the modelfit indices (indicated in bold in Table 5, see Goodness of fit) with the percentages of the original Qmatrix. We considered a difference in student mastery percentages of 10% or more (compared to the original percentages) as a remarkable difference. These percentages were indicated by an *.
First, we find the most notable differences between the bestfitting and the original Qmatrix for Singapore. Six attributes show differences in attribute mastery larger than 10% when comparing the original with the stepwise method’s Qmatrix. Except for “Fractions, decimals and proportions”, “Data organisation, representation and interpretation”, and “Probability”, other attributes all show significant differences in percentages with a range from − 22.66% (“Expressions, equations and functions”; (47.93–61.97)/61.97 in Table 5) to + 33.46% (“Whole numbers and integers”). Second, four attributes show large differences in student attribute mastery between the stepwise and original Qmatrix in the case of the USA, Australia, and Tunisia. For the USA this encompasses the attributes: “Fractions, decimals and proportions” (− 34.11%), “Measurement” (− 20.62%), “Lines, angles and shapes” (− 38.52%), and “Data organization, representation and interpretation” (+ 11.89%). For Australia, it includes the attributes: “Expressions, equations and functions” (− 10.75%), “Patterns” (− 32.29%), “Measurement” (− 12.98%), and “Probability” (− 16.18%). The results of Tunisia data show remarkable differences for “Whole numbers and integers” (12.67%), “Patterns” (− 69.73%), “Lines, angles and shapes” (13.79%), and “Data organization, representation and interpretation” (+ 25.89%). Third, for Finland, large differences in student attribute mastery between the bestfitting (the stepwise Qmatrix) and the original Qmatrix are found for three out of nine attributes with a range from − 39.86% (“Fractions, decimals and proportions”) to − 15.39% (“Whole numbers and integers”).
From Table 5 we can also derive remarkable student attribute mastery differences between the original and the Qmatrix of Chiu’s method across countries. Overall, we see a strong agreement between the original, stepwise, and Chiu Qmatrices. Yet, we find 21 remarkable differences between the original and the stepwise Qmatrix and 35 noticeable differences between the original and Chiu's Qmatrix across all countries and attributes. Although we find many differences between the original and the stepwise Qmatrix, we cannot distinguish tendencies for specific attributes. Whether the model based on the stepwise Qmatrix classifies more or less students as masters for a specific attribute depends on the country.
Overall classification of students as masters or nonmasters
Table 6 presents the range of agreement rates in the classification of students as masters or nonmasters of nine attributes between three different kinds of Qmatrices (original, stepwise, and Chiu's Qmatrix) for the five selected countries. The average percentage of nine attributes regarding the mastery or nonmastery agreement rate gives a general impression for three Qmatrices. First, we see that the original Qmatrix has high agreement rates with the Qmatrices refined by the stepwise method. The agreement rates of mastery and nonmastery are both over 80%. In contrast, agreement rates between Chiu's method and other methods are smaller (mostly between 70 and 75%). Nevertheless, the three different Qmatrices agree on the classification of most of the students as masters or nonmasters with rates higher than 70%. In addition, we tried to explore the reason for the inference consequences. The reason could be the overall Qmatrix misfit or the acrosscountries differences. The Qmatrix refined based on the fivecountries data was included in the comparison to clarify this interesting question. The results in Appendix 4 indicate that both could contribute to the inference consequences. Currently, there is no clear pattern.
Estimated item parameters
Figure 2 shows five scatter plots (corresponding to the five countries) that represent the item parameter estimates of the GDINA, specifically the intercept estimate (= \({\widehat{\delta }}_{j0}\)) where the original Qmatrix (xaxis) and the two refined Qmatrices (yaxis) are used. Recall that the intercept parameter refers to the probability of correctly solving the item \(j\) without mastering any required attribute(s) (i.e., the baseline effect). For each plot, the estimates (= dots) that involve the stepwise and Chiu’s methods are colored in blue and red, respectively. The dots that fall on the straight line in the plot suggest that the estimates between the original and the refined ones are an exact match. The results show that the intercept estimates using the stepwise method are more similar to estimates of the original Qmatrix than Chiu’s method, especially for Finland, Singapore, Australia, and Tunisia. The estimates using Chiu’s method (in red) are in general (but more noticeably for Singapore, USA, and Australia) lower than the estimates using the original one and, therefore, appear to deviate remarkably around the bottom right corner. Interestingly, our findings in this figure align with the previous classification results (Table 6) where the agreement rates between the original Qmatrix and the stepwise Qmatrix is larger.
Discussion
In this study, we examined the impact of a misspecified expertdesigned Qmatrix for CDA with the application of international largescale data (TIMSS 2011 8th grade mathematics). Specifically, our study paid particular attention to the recognition of different Qmatrices for an assessment that was refined by differentiated data structures by countries. First, the performance of the GDINA using refined Qmatrices as compared to the expertdesign Qmatrix was examined with regard to model fit criteria. Our investigation of the TIMSS data made clear that the original Qmatrix that was designed by experts without regard to countries failed to produce an equally good model fit as the two refined Qmatrices: the stepwise validation method (Ma & de la Torre, 2020a, b) or Chiu’s nonparametric classification method (Chiu, 2013). This finding, to some extent, naturally justifies the use of the countryspecific Qmatrix that was refined by each country separately. While there are equally pros and cons (as mentioned in "Qmatrix refinement methods" section) of the two refinement methods that researchers and practitioners must consider, we found from the TIMSS data that the stepwise method suggested a better model fit than Chiu’s method across the countries. The refined Qmatrices by the stepwise method for the five selected countries are provided in Appendix 5. Next, we found that using the GDINA with the stepwise Qmatrix was noticeably different from the expertdesigned Qmatrix in terms of probabilities of attribute mastery, classification accuracy, and item parameter estimation. Furthermore, the impact of using the refined Qmatrices varied across countries.
Several of our findings merit further discussion. First, researchers and practitioners need to consider the advantages and disadvantages of using countryspecific Qmatrices. One possible advantage of using it is that the expertdesigned (or original) Qmatrix is refined specifically by each country separately. Therefore, any difference in the refined outcomes (i.e., number of attributes required for the item) among countries for an assessment could suggest the country’s unique instructional contents. On the other hand, one disadvantage of using the countryspecific Qmatrix for CDA for international comparison studies is that it may be unfair (or biased) to directly compare student attribute mastery across countries because the Qmatrix applied in the fitted model was different from country to country after the countryspecific refinement, which produced unfair conditions for the acrosscountries comparison. Another possible disadvantage is that regarding retrofitting CDMs and refining the Qmatrix, the design of largescale assessments may not satisfy the completeness of the Qmatrix or identifiability conditions, which are required for identifying proficiency classes and estimating model parameters (Köhn & Chiu, 2016). Overall, we believe that consideration of the population heterogeneity for Qmatrix refinement is a relatively new and unexplored topic in the area; further research is needed for the appropriate use of the refined Qmatrix in reallife assessments.
Notwithstanding the unique insights offered by the current study, there remain some limitations to be considered. First, the two Qmatrix refinement approaches are still contingent on the original Qmatrix designed by experts with respect to the qentries replacements. Therefore, altering the number of attributes specified in the Qmatrix is beyond the scope of the current study. Furthermore, it is important to notice that when the qentries in the Qmatrix change, the definition, and interpretation of the attributes may change to some degree. In this way, the best fitting Qmatrix may not necessarily be interpretable or have practical value (Bradshaw et al., 2014; de la Torre, 2008). Moreover, no consensus exists in the current CDA and GDINA literature regarding the amount and content of specified attributes in the TIMSS mathematics Qmatrix. Researchers specify different Qmatrices for the same test and compare countries according to the predefined attributes (e.g., Im & Park, 2010; Park et al., 2017; Sedat & Arican, 2015). Therefore, a universal attribute design of the Qmatrix remains a critical issue within the current CDA literature (Groß et al., 2016).
Second, the original Qmatrix used in this study was established after the test items in TIMSS were calibrated by a unidimensional IRT. Retrofitting CDMs to the data may result in an unbalanced Qmatrix where some attributes are measured significantly more than others (Sedat & Arican, 2015). In this study, this was most pronounced for the ‘probability’ attribute that was only measured by five items in the original Qmatrix. This imbalance can distort the attribute classification of students because a small number of items per attribute can generate a situation in which responses to one or a couple of items determines the student’s mastery of that attribute (Jurich & Bradshaw, 2014). Therefore, if we want to increase the validity and reliability of student attribute mastery patterns estimated by CDMs, we recommend defining a relevant set of attributes first and then writing items that tap these attributes instead of the other way around (Birenbaum et al., 2005; Bradshaw et al., 2014).
Third, this study is explorative in nature, and some approaches to investigating differences between Qmatrices are very crude. For example, no reliability estimates of student attribute mastery estimates (e.g., Sessoms & Henson, 2018) are provided, nor did we perform any significance tests to investigate differences in student attribute mastery between different Qmatrices. In addition, we do not include relevant domain experts from a specific country to examine whether the refined countryspecific Qmatrices recommended by the stepwise method and Chiu’s method are meaningful or better than the original universal Qmatrix. We want to stress that the datadriven results are sampledependent and generalization to other countries or time points may not be warranted. Moreover, they should be used on sufficiently large datasets, but it is not clear yet when a sample size can be considered sufficiently large. Therefore, we recommend the datadriven results to be doublechecked in tandem with domain experts, or be used as ancillary information when the Qmatrix is discussed somehow. Relying on the datadriven results solely and blindly is certainly not recommended.
Conclusions
This study provided useful insights concerning differences between Qmatrix refinement techniques, the countryspecificity of Qmatrices, and the consequences for practice. Findings from this study could help optimize the Qmatrix so CDMs (e.g., GDINA) can be more widely used to extract diagnostic attributelevel information out of international comparative tests. Together with the expertise of domain experts, teachers and policymakers could use this finegrained information to tailor their instruction to students’ specific weaknesses and link this diagnostic information to the existing curricula and instructional practices of their particular country. In this way, CDMs are a crucial diagnostic information source that could help improve education systems all over the world.
Availability of data and materials
The datasets analyzed during the current study are TIMSS 2011 at https://timssandpirls.bc.edu/timss2011/internationaldatabase.html.
Notes
An attribute can be defined as a “skill or content knowledge that is required to solve a test item.” (Choi et al., 2015).
The expertdesigned Qmatrix can be found in Appendix 3.
Relevant results can be consulted by the following link “https://github.com/supplementmaterial/Qmatrixpaper”.
References
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. https://doi.org/10.1109/TAC.1974.1100705
Baker, F. B. (2001). The basics of item response theory. Retrieved from http:///ericae.net.irt/baker.
Birenbaum, M., Tatsuoka, C., & Xin, T. (2005). Largescale diagnostic assessment: Comparison of eighth graders’ mathematics performance in the United States, Singapore and Israel. Assessment in Education: Principles, Policy & Practice, 12(2), 167–181. https://doi.org/10.1080/09695940500143852
Bradshaw, L., Izsák, A., Templin, J., & Jacobson, E. (2014). Diagnosing teachers’ understandings of rational numbers: Building a multidimensional test within the diagnostic classification framework. Educational Measurement: Issues and Practice, 33(1), 2–14. https://doi.org/10.1111/emip.12020
Chen, J. (2017). A residualbased approach to validate Qmatrix specifications. Applied Psychological Measurement, 41(4), 277–293. https://doi.org/10.1177/0146621616686021
Chiu, C. Y. (2013). Statistical refinement of the Qmatrix in cognitive diagnosis. Applied Psychological Measurement, 37(8), 598–618. https://doi.org/10.1177/0146621613488436
Choi, K. M., Lee, Y. S., & Park, Y. S. (2015). What CDM can tell about what students have learned: An analysis of TIMSS eighth grade mathematics. Eurasia Journal Mathematics, Science & Technology Education. https://doi.org/10.12973/eurasia.2015.1421a
de la Torre, J. (2008). An empirically based method of Qmatrix validation for the DINA model: Development and applications. Journal of Educational Measurement, 45(4), 343–362. https://doi.org/10.1111/j.17453984.2008.00069.x
de la Torre, J. (2009). DINA model and parameter estimation: A didactic. Journal of Educational and Behavioral Statistics, 34(1), 115–130. https://doi.org/10.3102/1076998607309474
de la Torre, J. (2011). The Generalized DINA model framework. Psychometrika, 76(2), 179–199. https://doi.org/10.1007/s1133601192077
de la Torre, J., & Chiu, C. Y. (2016). General method of empirical Qmatrix validation. Psychometrika, 81(2), 253–273. https://doi.org/10.1007/s1133601594678
Desmarais, M. C., & Naceur, R. (2013). A matrix factorization method for mapping items to skills and for enhancing expertbased qmatrices. In: International Conference on Artificial Intelligence in Education (pp. 441–450). Berlin: Springer.
Groß, J., Robitzsch, A., & George, A. C. (2016). Cognitive diagnosis models for baseline testing of educational standards in math. Journal of Applied Statistics, 43(1), 229–243. https://doi.org/10.1080/02664763.2014.1000841
Hagenaars, J. A., & McCutcheon, A. L. (2002). Applied latent class analysis. Cambridge University Press.
Im, S., & Park, H. J. (2010). A comparison of US and Korean students’ mathematics skills using a cognitive diagnostic testing method: Linkage to instruction. Educational Research and Evaluation, 16(3), 287–301. https://doi.org/10.1080/13803611.2010.523294
Jia, B., Zhu, Z., & Gao, H. (2021). International Ccomparative study of statistics learning trajectories based on PISA data on Cognitive Diagnostic Models. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2021.657858
Johnson, M.S., Lee, Y.S., Park, J.Y., Zhang, Z., & Sachdeva, R. (2013). Comparing attribute distribution across countries: Application to TIMSS 2007 mathematics. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA.
Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric Item Response Theory. Applied Psychological Measurement, 25(3), 258–272. https://doi.org/10.1177/01466210122032064
Jurich, D. P., & Bradshaw, L. P. (2014). An illustration of diagnostic classification modeling in student learning outcomes assessment. International Journal of Testing, 14(1), 49–72. https://doi.org/10.1080/15305058.2013.835728
Köhn, H. F., & Chiu, C. Y. (2016). A procedure for assessing the completeness of the Qmatrices of cognitively diagnostic tests. Psychometrika, 82(1), 112–132. https://doi.org/10.1007/s1133601695367
Köhn, H. F., & Chiu, C. Y. (2018). How to build a complete Qmatrix for a cognitively diagnostic test. Journal of Classification, 35(2), 273–299. https://doi.org/10.1007/s0035701892550
Little, R. J. (1988). Missingdata adjustments in large surveys. Journal of Business & Economic Statistics, 6(3), 287–296. https://doi.org/10.2307/1391878
Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data (2nd ed.). Wiley.
Liu, J. (2015). On the consistency of Qmatrix estimation: A commentary. Psychometrika, 82(2), 523–527. https://doi.org/10.1007/s1133601594874
Liu, R., HugginsManley, A. C., & Bulut, O. (2017). Retrofitting diagnostic classification models to responses from IRTbased assessment forms. Educational and Psychological Measurement, 78(3), 357–383. https://doi.org/10.1177/0013164416685599
Liu, Y., Andersson, B., Xin, T., Zhang, H., & Wang, L. (2019). Improved Wald statistics for itemlevel model comparison in diagnostic classification models. Applied Psychological Measurement, 43(5), 402–414. https://doi.org/10.1177/0146621618798664
Liu, Y., Tian, W., & Xin, T. (2016). An application of M2 statistic to evaluate the fit of cognitive diagnostic models. Journal of Educational and Behavioral Statistics, 41(1), 3–26. https://doi.org/10.3102/1076998615621293
Ma, W., & de la Torre, J. (2020a). GDINA: An R package for cognitive diagnosis modeling. Journal of Statistical Software, 93(14), 1–26. https://doi.org/10.18637/jss.v093.i14
Ma, W., & de la Torre, J. (2020b). An empirical Qmatrix validation method for the sequential generalized DINA model. British Journal of Mathematical and Statistical Psychology, 73(1), 142–163. https://doi.org/10.1111/bmsp.12156
MaydeuOlivares, A. (2013). Goodnessoffit assessment of item response theory Models. Measurement: Interdisciplinary Research & Perspective, 11(3), 71–101. Doi: https://doi.org/10.1080/15366367.2013.831680
MaydeuOlivares, A., Cai, L., & Hernández, A. (2011). Comparing the fit of item response theory and factor analysis models. Structural Equation Modeling: A Multidisciplinary Journal, 18(3), 333–356. https://doi.org/10.1080/10705511.2011.581993
MaydeuOlivares, A., & Joe, H. (2014). Assessing approximate fit in categorical data analysis. Multivariate Behavioral Research, 49(4), 305–328. https://doi.org/10.1080/00273171.2014.911075
Mullis, I. V., Martin, M. O., Foy, P., & Arora, A. (2012). TIMSS 2011 international results in mathematics. International Association for the Evaluation of Educational Achievement (IEA). Amsterdam: IEA Secretariat.
Nájera, P., Sorrel, M. A., & Abad, F. J. (2019). Reconsidering cutoff points in the general method of empirical Qmatrix validation. Educational and Psychological Measurement, 79(4), 727–753. https://doi.org/10.1177/0013164418822700
Park, J. Y., Lee, Y. S., & Johnson, M. S. (2017). An efficient standard error estimator of the DINA model parameters when analyzing clustered data. International Journal of Quantitative Research in Education, 4(1/2), 244–264. https://doi.org/10.1504/ijqre.2017.10007548
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.Rproject.org/.
Ravand, H., & Robitzsch, A. (2015). Cognitive diagnostic modeling using R. Practical Assessment, Research & Evaluation, 20(1), 11. https://doi.org/10.7275/5g6fak15
Rubin, D. B. (1986). Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business & Economic Statistics, 4(1), 87–94. https://doi.org/10.2307/1391390
Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. Guilford Press.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. https://doi.org/10.1214/aos/1176344136
Sedat, ŞE. N., & Arican, M. (2015). A diagnostic comparison of Turkish and Korean students’ mathematics performances on the TIMSS 2011 assessment. Eğitimde Ve Psikolojide Ölçme Ve Değerlendirme Dergisi, 6(2), 238–253. https://doi.org/10.21031/epod.65266
Sessoms, J., & Henson, R. A. (2018). Applications of diagnostic classification models: A literature review and critical commentary. Measurement: Interdisciplinary Research and Perspectives, 16(1), 1–17. https://doi.org/10.1080/15366367.2018.1435104
Tatsuoka, K. K. (1984). Analysis of errors in fraction addition and subtraction problems. Final Report. Retrieved from University of Illinois, ComputerBased Education Research Lab website: https://files.eric.ed.gov/fulltext/ED257665.pdf.
Templin, J. L., & Henson, R. A. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11(3), 287–305. https://doi.org/10.1037/1082989x.11.3.287
Terzi, R., & de la Torre, J. (2018). An iterative method for empiricallybased Qmatrix validation. International Journal of Assessment Tools in Education, 5(2), 248–262. https://doi.org/10.21449/ijate.407193
von Davier, M., & Lee, Y. S. (2019). Handbook of diagnostic classification models: Models and model extensions, applications, software packages. Springer Publishing.
Wang, W., Song, L., Ding, S., Meng, Y., Cao, C., & Jie, Y. (2018). An EMbased method for Qmatrix validation. Applied Psychological Measurement, 42(6), 446–459. https://doi.org/10.1177/0146621617752991
Wu, X., Wu, R., Chang, H. H., Kong, Q., & Zhang, Y. (2020). International comparative study on PISA mathematics achievement test based on cognitive diagnostic models. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2020.02230
Zheng, Y., Chiu, C.Y., & Douglas, J. (2019). NPCD: Nonparametric methods for cognitive diagnosis; R Package Version 1.0–11. https://CRAN.Rproject.org/package=NPCD
Acknowledgements
Not applicable.
Funding
The authors received no specific funding for this work.
Author information
Authors and Affiliations
Contributions
JD and CC contributed equally to this work as first authors. They prepared the initial draft of the analyses and manuscript. JYP and WVN supervised the study and revised and commented on the draft. All authors read and approved the final manuscript.
Author’s information
Jolien Delafontaine is a PhD student at the Faculty of Psychology and Educational Sciences, Parenting and Special education Unit of the KU Leuven. Her doctoral research focuses on effective teaching for students with special educational needs (SEN).
Changsheng Chen is a PhD student at the Faculty of Psychology and Educational Sciences, and the imec research group itec of the KU Leuven. His doctoral research focuses on learning analytics.
Jung Yeon Park is an assistant professor of quantitative research methods at George Mason University. Her research focuses on cognitive diagnosis models, largescale educational assessments, and learning analytics.
Wim Van den Noortgate is a professor of statistics at the Faculty of Psychology and Educational Sciences, and the imec research group itec of the KU Leuven. His major interests include learning analytics and metaanalysis.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Flowcharts of the Stepwise Method and Chiu’s Method
Appendix 2: Attributes of the Original Qmatrix
See Table
7.
Appendix 3: Qmatrix for 89 selected items of TIMSS 2011
See Table
8.
Appendix 4: Mastery/Nonmastery Agreement Rates (%) between each pair of Qmatrices
See Table
9.
Appendix 5: Refined Qmatrix for five selected countries based on the stepwise method
See Tables
10,
11,
12,
13,
14.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Delafontaine, J., Chen, C., Park, J.Y. et al. Using countryspecific Qmatrices for cognitive diagnostic assessments with international largescale data. Largescale Assess Educ 10, 19 (2022). https://doi.org/10.1186/s40536022001384
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40536022001384
Keywords
 GDINA
 Qmatrix refinement
 Stepwise validation method
 Nonparametric classification method
 TIMSS 2011 mathematics
 International comparison