Comparison of disengagement levels and the impact of disengagement on item parameters between PISA 2015 and PISA 2018 in the United States
Large-scale Assessments in Education volume 11, Article number: 4 (2023)
Examinees may not make enough effort when responding to test items if the assessment has no consequence for them. These disengaged responses can be problematic in low-stakes, large-scale assessments because they can bias item parameter estimates. However, the amount of bias, and whether this bias is similar across administrations, is unknown. This study compares the degree of disengagement (i.e., fast and non-effortful responses) and the impact of disengagement on item parameter estimates in the Programme for International Student Assessment (PISA) across the 2015 and 2018 administrations.
We detected disengaged responses at the item level based on response times and response behaviors. We used data from the United States and analyzed 51 computer-based mathematics items administered in both PISA 2015 and PISA 2018. We compared the percentage of disengaged responses and the average scores of the disengaged responses for the 51 common items. We filtered disengaged responses at the response- and examinee-levels and compared item difficulty (P+ and b) and item discrimination (a) before and after filtering.
Our findings suggested that there were only slight differences in the amount of disengagement in the U.S. results for PISA 2015 and PISA 2018. In both years, the amount of disengagement was less than 5.2%, and the average scores of disengaged responses were lower than the average scores of engaged responses. We did not find any serious impact of disengagement on item parameter estimates when we applied response-level filtering; however, we found some bias, particularly on item difficulty, when we applied examinee-level filtering.
This study highlights differences in the amount of disengagement in PISA 2015 and PISA 2018 as well as the implications of the decisions made for handling disengaged responses on item difficulty and discrimination. The results of this study provide important information for reporting trends across years.
Disengagement, which is defined as providing or omitting responses to test items without making an adequate effort, is a problem for many tests. Examinees’ true ability cannot be understood from scores if they do not exert sufficient effort to solve the items. Consequently, the interpretation of the test scores may be inappropriate, and the validity of the inferences based on these scores would deteriorate (Wise, 2017). The issue of disengagement can be particularly problematic in low-stakes assessments where scores do not have any consequence on the examinees. Examinees may provide disengaged responses due to a lack of motivation (Wise, 2015). As such, disengagement has been recognized as a problem for many low-stakes, large-scale assessments, such as the Programme for International Student Assessment (PISA) and the National Assessment of Educational Progress (NAEP).
Disengagement has been found to bias the estimation of examinees’ ability (Wise, 2015; Wise & Kingsbury, 2016) and item parameters (Wise & DeMars, 2006; Yamamoto, 1995). Although studies have examined differences in disengagement between students, schools, and countries (Debeer et al., 2014; Rios & Guo, 2020), they have not done so across years. This is an important gap, because many large-scale assessments, such as PISA and NAEP, report score trends between years across countries and subgroups of students. If the degree of disengagement differs across administrations, the impact of disengagement on item parameter estimates can also vary across administrations, and this would lead to an issue of comparability of scores and, therefore, of the score trends reported for countries and subgroups. This study addresses this gap in the literature by investigating the level of disengagement (i.e., the percentage of responses or examinees detected as disengaged for each item) in items administered in both PISA 2015 and PISA 2018 and the impact of disengagement on item parameter estimates.
Most of the literature on disengagement has focused on one type of disengagement, namely rapid guessing, to multiple-choice single-select (MCSS) items. Recently, Sahin and Colvin (2020) broadened the conceptualization of disengagement to include rapid guesses to MCSS items, as well as rapid-omit, rapid-irrelevant responses to constructed-response items, coining the term rapid disengagement to represent this broader concept, which covers different item types (e.g., constructed-response items) and response decisions (e.g., no response). Because there are multiple item types in PISA, we followed the broader conceptualization of disengagement in this study and used the term “disengagement” in the same sense as “rapid disengagement.”
In this section, we first discuss the relationship between disengagement and item parameters and then review approaches for detecting disengagement.
Disengagement and item parameters
Recent research has found that disengagement differs by item. For example, Schnipke and Scrams (1997) claimed that the rapid guessing was essentially the same across items. However, Goldhammer et al. (2017) indicated that disengaged responses were provided more commonly in response to difficult items than to easy items in the Programme for the International Assessment of Adult Competencies (PIAAC). This suggests that the bias stemming from disengaged responses can differ from item to item.
Many studies have examined the impact of disengagement on bias in the estimation of item parameters. In a meta-analysis study, Rios and Deng (2021) investigated 53 studies that used different criteria for classifying examinees as disengaged and produced effect sizes on how much bias that disengagement introduced to item parameters. Rios and Deng made three key observations: (a) studies typically investigated the impact of disengagement on item difficulty, leaving out item discrimination, (b) the different methods used to detect disengagement resulted in differences in the number of disengaged examinees detected, and (c) these differences were not associated with statistically significant differences in average item difficulty after applying a process called motivation filtering, a term suggested by Sundre and Wise (2003), which suggests removing disengaged responses.
As for specific examples of research on the impact of disengagement, Yamamoto (1995) found that 30% of examinees omitted and rapid-guessed on one-third of the items in a simulation study, resulting in changes in both item discrimination and item difficulty parameters when they were estimated using a two-parameter logistic (2PL) item response theory (IRT) model. However, Yamamoto (1995) did not observe a clear pattern in how the omitted and rapid-guessed responses influenced the item parameters. Item discrimination and item difficulty parameters increased dramatically in some items but decreased in others. Similarly, Wise and DeMars (2006) compared the original and estimated values of item parameters when rapid-guessing was present in 2.3%, 6.7%, and 11.3% of the responses. They found that rapid-guessing led to overestimation of both item difficulty and item discrimination when a three-parameter logistic (3PL) IRT model was used. Wise et al. (2006) labeled approximately 11–53% of examinees as disengaged in five different assessments. They found that mean test scores increased but that standard deviations decreased after filtering out the disengaged examinees. Bovaird and Embretson (2006) found that item discrimination decreased significantly, but that item difficulty increased significantly, in a 2PL IRT model after applying motivation filtering. While these studies highlight the possible relationship between disengaged responses and item parameters, they do not indicate how much item disengagement is sufficient to cause significant bias in item parameters. The stability of these differences across administrations is also unknown.
Approaches to identifying disengagement
A few statistical approaches have been developed for identifying disengagement. Most approaches use variables, derived from process data, which refer to the cumulation of records of examinees’ clicks and keystrokes while they are taking computer-based tests. Identifying disengagement at the item level (i.e., item-level disengagement) typically requires establishing thresholds on variables such as response time (i.e., total time spent on an item). If the response time associated with a response is at or below the threshold for a specific item, the response is labeled as disengaged; if it is above the threshold, the response is labeled as engaged.
In a meta-analysis, Rios and Deng (2021) found that the choice of the response time threshold was associated with nonnegligible differences in the number of responses and examinees identified as disengaged. Therefore, we believe it is useful to outline some of the most common ways to set response time thresholds and how they are used with other variables. The item-level detection approaches described in the literature fall into one of the three categories: (a) response time only, (b) response time and accuracy, and (c) response time and response behaviors.
Response time only
This approach requires defining a response time threshold, which corresponds to the minimum response time that is needed for an examinee to provide an engaged response (e.g., Wise & Kong, 2005). Kong et al. (2007) outlined four ways to specify a threshold: (a) the Common Threshold Method, which proposes a constant threshold (e.g., three seconds) for all of the items on the test; (b) the Reading Time Method, which estimates reading time with item surface features, such as the number of characters, and ancillary reading; (c) the Visual Spike Method, which inspects the response time distribution visually and sets the threshold at the endpoint of an early spike in a bimodal response time distribution; and (d) the Mixture Model-Based Method, which fits the response time distribution of an item to a finite mixture model and sets the threshold based on the best-fitting model. Wise and Ma (2012) introduced a fifth method, namely the Normative Threshold Method, which defines the threshold as a certain percentage of the average item response time of all examinees. Wise and Ma (2012) found that a threshold as 10% of the average response time, with a maximum value of 10 seconds, best distinguished rapid guessing from solution behavior compared to other percentages studied. One caveat of response time only methods is that they can misclassify fast-thinking test takers as disengaged (Wise, 2017). To overcome this shortcoming, researchers have proposed methods to detect disengagement that use response times in conjunction with other variables.
Response time and accuracy
Using both response time and response accuracy is an alternative approach to setting the response time threshold (Guo et al., 2016; Lee & Jia, 2014; Ma et al., 2011). The first step is to compute the proportion correct conditional on response time for each item (Ma et al., 2011). Then, the threshold is set at the first response time corresponding to a proportion correct that is greater than the random chance level (i.e., 25% for an MCSS item with four options). One caveat of using this method is that the response accuracy associated with rapid guessing can be significantly different from random chance (Wise, 2017; Wise & Ma, 2012). Sahin and Colvin (2020) reported that the probability of a correct response is zero for a rapid response to constructed-response items. Similarly, the probability of a correct response is zero for a rapid omit to any kind of item (Sahin & Colvin, 2020). Thus, response accuracy is inapplicable to specify rapid guessing at a random chance level to item types other than MCSS and to rapid-omit behaviors.
Response time and response behaviors
The use of response time and response behaviors, what Sahin and Colvin (2020) referred to as the “enhanced” method, has been shown to be a better approach to detecting examinees who display disengagement (Sahin & Colvin, 2020). Specifically, this method detects disengagement using the number and type of response behaviors (e.g., keypresses, clicks, and clicking interactive tools), which are derived from the process data, in addition to response time. To apply this approach, two sets of thresholds are set jointly for an item: one for response time and one for the number of response behaviors. The threshold for the number of response behaviors specifies the maximum number of response behaviors that exhibit no or minimum engagement. If an examinee responds to an item faster than (or equal to) the response time threshold and performs fewer actions than (or equal to) the number of response behaviors threshold for that item, that response is flagged as disengaged. Sahin and Colvin (2020) used a constant value as the threshold for the number of actions for all of the items under investigation and suggested that the distribution of the number of actions can be used to set the threshold in future research.
The aim of this study is to compare the degree of disengagement and the impact of disengagement on item parameter estimates in low-stakes, large-scale assessments across administrations. To achieve this goal, we investigated differences between the prevalence and impact of disengagement in the 2015 and 2018 administrations of PISA. The research questions are:
How much does the percentage of disengagement differ between the items common to PISA 2015 and PISA 2018?
How much do the scores of disengaged responses differ between the items common to PISA 2015 and PISA 2018?
How much do estimates of item difficulty and item discrimination, with and without disengagement, change between 2015 and 2018?
In this section, we will first introduce the data used in this study and then discuss the analyses conducted, step by step, from detecting disengagement to comparing the percentage of disengagement in PISA 2015 and PISA 2018; comparing scores for disengaged responses in PISA 2015 and PISA 2018; and comparing weighted item parameter estimates with and without disengagement in PISA 2015 and PISA 2018.
The Programme for International Student Assessment (PISA) was developed by the Organisation for Economic Co-operation and Development (OECD) to monitor student performance and provide comparative indicators of education systems across the world (OECD, 2000). It is administered every three years to 15-year-old students in more than 70 countries in reading, mathematics, and science. In each cycle, PISA focuses on one of these subjects, and the other two subjects are administered as minor assessment areas for trend purposes. In this study, we analyzed mathematics data from U.S. students in the two most recent PISA administrations, conducted in 2015 and 2018. Specifically, we analyzed 51 items common to both the 2015 and 2018 administrations. These items were distributed in various blocks due to matrix sampling.
The total number of U.S. students who participated in the mathematics assessment was 5712 in 2015 and 4838 in 2018, including 2854 (50%) female and 2858 (50%) male students in 2015, and 2376 (49.1%) female and 2462 (50.9%) male students in 2018. The majority of the students were in grade 10 (73.7% in 2015 and 74.4% in 2018), about 10% were 9th graders or lower (9.5% in 2015 and 8.5% in 2018), and about 15% were 11th graders or higher (16.8% in 2015 and 17.2% in 2018). In both years, 90% of the students did not repeat a grade, and the rest 10% repeated a grade. Because randomly equivalent students receive each block of test items, it is reasonable to assume that the overall size, demographic composition, and ability level of the analytical sample are similar for each item. The sample size for each item ranged from 643 to 736 in 2015 and from 767 to 833 in 2018. For analyses of the weighted P+ and item parameter estimates, we used the final student weight (“W_FSTUWT”) variable, which is available in the public-use datasets.
A computer-based assessment (CBA) was the main mode of assessment in both PISA 2015 and 2018. In this study, we used three variables that are available in the PISA public-use datasets in both 2015 and 2018: Total Time, Number of Actions, and Scored Response. Total Time is a continuous variable derived from the process data for each item that specifies the total amount of time that each student spent on the items. Number of Actions is another continuous variable derived from the process data for each item that specifies the number of steps each student took before giving their final responses (OECD, 2017). Scored Response is a categorical variable with six categories: 0 = No Credit, 1 = Full Credit, 6 = Not Reached, 7 = Not Applicable, 8 = Invalid, and 9 = No Response (OECD, 2017, p. 198). None of the responses for the 51 items included in this study were in category 7 (Not Applicable) or category 8 (Invalid). In order to analyze the average score, we recoded category 6 (Not Reached) responses to missing values. Omitted responses (category 9, No Response) were recoded to 0, following the same coding procedure for missing scores that PISA uses in its own methodology (OECD, 2017, p. 149).
We followed the enhanced method (Sahin & Colvin, 2020) to detect disengagement due to the limitations of the other methods, as discussed above. The enhanced method detects responses that are more likely to represent disengagement and covers all of the types of disengagement that are likely to be present in the PISA items: rapid guessing to MCSS items, rapid omitting, and rapid-irrelevant responses to constructed-response items. To apply this method, we set thresholds for Total Time and for Number of Actions. The remainder of this section provides detailed information on the steps that we took: (1) establishing Total Time thresholds for each item, (2) establishing Number of Actions thresholds for each item, (3) comparing the percentage of disengagement between 2015 and 2018, (4) comparing the scores of disengaged responses between 2015 and 2018, (5) removing the responses identified as disengaged from the analytical sample (i.e., motivation filtering), and (6) comparing the weighted item parameters (P+, a, and b) before and after applying motivation filtering in 2015 and 2018.
Establishing total time thresholds
Among the various threshold-setting methods suggested in the literature, we utilized the Normative Threshold Method (Wise & Ma, 2012). It suggests setting the response time threshold for an item as 10% of the average response time, with a maximum value of 10 seconds. For each item in this study, average response time was computed based on the Total Time variable. Any value larger than 10 minutes for one item was considered an outlier and removed from computing the average response time since examinees were expected to complete the test within 60 minutes.
For comparison purposes, we set a common threshold value for the same item in 2015 and 2018. To achieve this, we first checked and confirmed that the Total Time variables for each item in 2015 and 2018 had the same distribution. Specifically, we inspected whether the distributions were similarly based on minimum, maximum, mean, and mode values, which are also related to the location of the peak and the overall shape. We did not observe a binomial distribution in many items, which is another reason for our decision to use the Normative Threshold Method. Then, we merged the Total Time variables in 2015 and 2018 for each item and computed 10% of the average response time, with a maximum of 10 seconds, as the common threshold.
Establishing number of actions thresholds
Because the minimum number of interactions needed to provide an effortful response can vary substantially across the items in our data, we set the Number of Actions thresholds by adapting the Visual Spike idea developed for response time (Wise & Kong, 2005) to the number of actions. When we inspected the distribution of the Number of Actions variables, we observed a unimodal distribution. Therefore, we specified the threshold at the beginning of the spike, assuming that this value represents the minimum number of actions that effortful students take. Given that each character entry or click is counted as an action, 200 actions represented a relatively lengthy response (e.g., 50 words) to constructed-response items, and any number of actions greater than 200 was considered an outlier and not included in plotting the distribution.
For comparison purposes, we set a common Number of Actions threshold value for the same item in 2015 and 2018. The first step in this procedure was to inspect the distribution of the Number of Actions variable for each item in 2015 and 2018. While the shape of the distribution is similar in both years, the values for the Number of Actions variable in 2018 were one less than the corresponding values in 2015. Confirming our observation, we learned that clicking the “next” button to move to the next item in a unit or to the next unit in the test was counted in computing the total number of actions in 2015 but not in 2018 (M. Ikeda, personal communication, August 20, 2020). To obtain the same scale for this variable, we added the value of “one” to the Number of Actions variable in 2018 and then merged the variables in 2015 and 2018 for each item. We then plotted the variables based on the merged data for each item and set the value at the beginning of the spike in the distribution as the threshold.
Comparing the percentage of disengagement
To answer the first research question, we detected disengaged responses based on the Total Time threshold and Number of Actions threshold for each item. Then we compared the percentage of disengagement for each item between administration year.
Comparing the scores of disengaged responses
To answer the second research question, we first compared the scores for disengaged responses and engaged responses separately in each year. Then we compared the differences between the scores for the engaged and disengaged responses across years.
Removing disengagement from the data is termed motivation filtering, and it requires treating the data points associated with disengagement as missing. However, there is no consensus on how motivation filtering should be applied. While some studies remove only the responses that are detected as disengaged, other studies remove all of an examinee’s responses if any response from that examinee is associated with disengagement. Rios et al. (2017) coined the term response-level filtering to refer to the first type of motivation filtering and examinee-level filtering to refer to the latter. Wise (2009) used the term rapid-response filtering to refer to response-level filtering.
In this study, we used both response-level and examinee-level filtering to understand the role of the filtering method in examining the impact of disengagement on item parameters. When applying examinee-level filtering in this study, all of the responses from students who provided a disengaged response to at least one item were removed. In total, 145 students, or 3% of the sample, were removed in 2015 when examinee-level filtering was applied. Similarly, 242 students, or 4.23% of the sample, were removed in 2018 when examinee-level filtering was applied.
Comparing the impact of disengagement on weighted item parameters
To answer the third research question, we compared item parameter estimates computed under both classical test theory (CTT) and IRT. First, we took the subset of students’ scores to all 51 items common to both the 2015 and 2018 administrations of PISA. We then computed item difficulty (i.e., the proportion of correct responses, P+) following the same coding procedure used for missing scores in PISA, where omitted responses are scored as incorrect. We compared the value for P+ before and after applying response- and examinee-level filtering.
Next, we conducted the national item calibration by computing the IRT item parameter estimates for the items administered in the United States in 2015 and 2018. Consistent with the technical procedures followed in PISA, a 2PL model was used for the binary items and a generalized partial credit model (GPCM) for the polytomous item responses (OECD, 2017, 2022). The latent trait (θ) was assumed to be normally distributed, and the mean was set to 0 and the variances to 1 to identify the models. The 51 common mathematic items were scaled separately for each of six conditions: (1) before any filtering in 2015, (2) after applying response-level filtering in 2015, (3) after applying examinee-level filtering in 2015, (4) before any filtering in 2018, (5) after applying response-level filtering in 2018, and (6) after applying examinee-level filtering in 2018. We used the “mirt” (Chalmers, 2012) package in the R (R core team, 2020) environment (Version 4.0.2) to estimate the weighted IRT model parameters. We then examined the impact of disengagement on the estimated item difficulty (b) and item discrimination (a). We compared a and b before filtering and after applying response- and examinee-level filtering, following the same procedure used to compare P+.
Percentage of disengagement
We examined the percentage of disengagement for each CBA mathematics item in 2015 and 2018 to answer the first research question: How much does the percentage of disengagement differ between the items common to PISA 2015 and PISA 2018 (see Fig. 1; Table 2). At the item level, the percentage of disengagement was slightly higher in PISA 2018 than in PISA 2015. The percentage of disengagement ranged from 0% to 2.86% in 2015 and from 0.13% to 5.20% in 2018, with an average of 0.79% in 2015 and 1.37% in 2018. Item CM992Q02 was associated with the highest percentage of disengagement in 2015 and in 2018. In both years, the level of disengagement detected was below 1% for most items (36 items in 2015 and 22 items in 2018). No disengagement was detected for two items, CM423Q01 and CM919Q01, in 2015 (see Table 3).
Figure 2 shows the difference between the percentage of disengagement in 2015 and 2018 for each of the 51 items, computed as the percentage of disengagement for an item in 2015 subtracted from the percentage of disengagement for that item in 2018. Each dot represents an item, and many of the dots are located around the horizontal line y = 0, thus representing very small differences. The largest difference between the percentage of disengagement in 2015 and 2018 was 2.47% on item CM564Q02, and the smallest difference was 0.01% on item CM496Q01. The differences were less than 1% in 37 (or 72.5%) of the items, less than 2% but more than 1% in 11 (or 21.6%) of the items, and less than 3% but more than 2% in 3 (or 5.9%) of the items.
Scores of disengaged responses
The results for the second research question—How much do the scores of disengaged responses differ between the items common to PISA 2015 and PISA 2018—suggest that disengaged responses received lower scores than engaged responses in both 2015 and 2018 (see Fig. 3). This finding is consistent with the expectation that disengaged responses are less likely to be correct than are engaged responses.
The average scores of disengaged responses ranged from 0 to 0.5 in both 2015 and 2018, while the average scores of engaged responses ranged from 0.01 to 0.89 in 2015 and from 0.02 to 0.89 in 2018 (see Fig. 3; Table 4). In both 2015 and 2018, the average scores of engaged responses were similar to the average scale scores reported for the population (OECD, 2017), with the largest difference equal to 0.07. The average scores of disengaged responses were 0 for most of the 51 items (41 items in 2015 and 38 items in 2018). Given that omitted responses were also scored as incorrect, and represented with a score of 0, most of the disengaged responses were either incorrect or represented no response in most of the items. The average scores of disengaged responses were not applicable (N/A) for two items in 2015 (CM423Q01 and CM919Q01), because no disengagement was detected for these items.
Among the 51 items examined, the average scores of the disengaged responses did not change between years for 37 items; in 35 of these items, the average score was zero. For 14 items, the average scores under disengagement differed slightly between 2015 and 2018 without a clear pattern being observed. In nine of the items, the average score under disengagement was greater in 2018, but in three items, the average score was greater in 2015. In two of the items, no examinees were detected as disengaged in 2015; therefore, the average could not be compared between years.
Changes in item parameters
To answer the third research question—How much do estimates of item difficulty and item discrimination, with and without disengagement, change between 2015 and 2018—we computed and compared item difficulty (P+ and b) as well as item discrimination (a).
Comparison of P+
The overall pattern of differences in P+ with disengagement (i.e., before filtering) and without disengagement (i.e., after filtering) was the same in 2015 and 2018. P+ increased for most items (about 40) in both years after applying either response-level or examinee-level filtering (see Fig. 4). In other words, most items became slightly easier after motivation filtering was applied. In both years and for most items, applying examinee-level filtering resulted in larger differences in P+ than response-level filtering (Fig. 4; Table 5).
After applying response-level filtering, P+ increased slightly for 38 items in 2015 and 50 items in 2018, and it remained the same for 12 items in 2015. For two items, CM998Q04 in both 2015 and 2018, as well as CM905Q01 in 2015, P+ decreased slightly (< 0.005). Item CM155Q01 showed the largest difference in P+, increasing by 0.012 in 2015 and by 0.016 in 2018. The absolute values of the differences for all 51 items in both 2015 and 2018 were less than 0.02.
After applying examinee-level filtering, P+ increased slightly for 43 items in 2015 and 49 items in 2018, and it remained the same for 6 items in 2015. For two items, CM998Q04 in both 2015 and 2018, as well as CM305Q01 in 2018, P+ decreased slightly (< 0.017). 12 items in 2015 and 29 items in 2018 had nonignorable differences (absolute values of differences ≥ 0.02) in P+ after applying examinee-level filtering. Among these items, CM915Q02 had the largest difference in P+, with increases of 0.032 in 2015 and 0.056 in 2018.
Moreover, of the 15 items in 2015 and 29 items in 2018 that had a higher degree of disengagement (i.e., 1% or higher), the majority (10 in 2015 and 26 in 2018) also had a larger increase in P+ (about 0.01 or higher) after applying examinee-level filtering. However, we did not observe this pattern after applying response-level filtering. There were exceptions in which P+ did not increase very much after applying examinee-level filtering even though a relatively high percentage of the responses were disengaged. For example, item CM943Q02, where disengagement was detected in 2.05% of the responses in 2018, P+ increased to only 0.002 after examinee-level filtering was applied.
Comparison of item difficulty (b)
The overall pattern of differences in b parameter estimates with disengagement (i.e., before filtering) and without disengagement (i.e., after filtering) was the same in 2015 and 2018. The b parameter estimates decreased for most items in both years after applying either response-level filtering (35 in 2015 and 41 in 2018) or examinee-level filtering (47 in 2015 and 50 in 2018, see Fig. 5; Table 6). Consistent with the findings for P+, most items became slightly easier after motivation filtering. In both years and for all 51 items, the differences in the b parameter estimates were larger after applying examinee-level filtering than response-level filtering.
After applying response-level filtering, the change in the b estimates ranged from −0.583 to 0.028 in 2015 and from −0.621 to 0.044 in 2018. For two items in 2015 and two items in 2018, the absolute values of the differences in the b parameter estimates before and after applying response-level filtering were larger than 0.1. To observe changes in individual items, we also took the standard error (SE) of the b parameter estimates into account. We concluded that the b values were different if the absolute difference between the b parameter estimates before and after applying filtering was larger than the SE of the b parameter estimates before filtering. Based on this comparison, the b values were found to be different for 8 items in 2015 and 16 items in 2018 after applying response-level filtering.
After applying examinee-level filtering, the changes in the b values ranged from −1.595 to 0.05 in 2015 and from −1.891 to 0.078 in 2018. For ten items in 2015 and 25 items in 2018, the absolute values of the difference in the b parameters before and after applying examinee-level filtering were larger than 0.1. Applying examinee-level filtering resulted in slightly larger differences in the b values than applying response-level filtering. Considering the SE, the b values were different for 36 items in 2015 and for 48 items in 2018 after applying examinee-level filtering.
Most items that had a relatively higher degree of disengagement (i.e., 1% or higher) also had a larger decrease in b estimates after applying response-level (about 0.2 or larger) or examinee-level (about 1 or larger) filtering. In particular, b estimates with extreme values tended to decrease more after applying examinee-level filtering: for example, CM998Q04 (b = 4.198 before filtering; decreased by 1.595 after applying examinee-level filtering in 2015), CM800Q01 (b = −3.849 before filtering; decreased by 1.89 after applying examinee-level filtering in 2018), and CM982Q01 (b = −2.244 before filtering; decreased by 0.90 after applying examinee-level filtering in 2018).
Comparison of item discrimination (a)
The overall pattern of differences in item discrimination (a) with disengagement (i.e., before filtering) and without disengagement (i.e., after filtering) was the same in 2015 and 2018. The a parameter estimates decreased for most items in both years after applying either response-level filtering (35 items in 2015 and 38 items in 2018) or examinee-level filtering (42 items in 2015 and 42 items in 2018, see Fig. 6). In both years, the differences in a parameter estimates were larger for almost all items (47 items in both 2015 and 2018) after applying examinee-level filtering than response-level filtering.
After applying response-level filtering, the change in a values ranged from −0.042 to 0.039 in 2015 and from −0.105 to 0.106 in 2018. The a parameter estimates decreased less than 0.1 for 35 items in 2015 and 37 items in 2018, and decreased more than 0.1 but less than 0.2 for none of the items in 2015 and only one item in 2018. Similar to our comparison of the b values, we took the SE of the a parameters into consideration to observe changes in individual items. We concluded that the a values were different if the absolute difference between the a parameter estimate before and after applying filtering was larger than the SE of the a parameter estimate before filtering. With respect to the SE, the a values were different for 11 items in 2015 and 18 items in 2018 after applying response-level filtering.
After applying examinee-level filtering, the change in a values ranged from −0.303 to 0.406 in 2015 and from −0.468 to 0.262 in 2018. The a parameter estimates decreased less than 0.1 for 34 items in 2015 and 20 items in 2018, decreased more than 0.1 but less than 0.2 for 7 items in 2015 and 16 items in 2018, and decreased more than 0.2 for 1 item in 2015 and 6 items in 2018. With respect to the SE, the a values were different for 39 items in 2015 and 44 items in 2018 after applying examinee-level filtering. Applying examinee-level filtering resulted in slightly larger differences in the a values than did applying response-level filtering.
This study provides insight into the comparability of the percentage of disengagement and of the average scores of disengaged responses in PISA 2015 and PISA 2018, as well as of the impact of disengagement on item parameter estimates.
As to the first research question—How much does the percentage of disengagement differ between the items common to PISA 2015 and PISA 2018—the results suggest that in the U.S. sample there was only small differences between PISA 2015 and PISA 2018. Less than 5.2% of the responses were associated with disengagement in individual CBA math PISA items in both years, and 3% of the examinees in 2015 and 4.23% in 2018 were associated with disengagement in at least one item.
For the second research question—How much do the scores of disengaged responses differ between the items common to PISA 2015 and PISA 2018—we found that the average scores of disengaged responses were less than 0.5 and that they were lower than the average scores of engaged responses. This pattern was the same in PISA 2015 and PISA 2018.
For the third research question—How much do estimates of item difficulty and item discrimination, with and without disengagement, change between 2015 and 2018—the results show that the overall pattern of differences in item parameters with disengagement and after applying response-level and examinee-level filtering was similar across years. A summary of the results of the changes in item parameters is provided in Table 1. Applying response-level filtering resulted in small differences in the P+, a, and b values. Applying examinee-level filtering resulted in relatively large differences in the P+, a, and b values, even when a small percentage of disengaged responses was detected for an item, introducing some bias. A similar pattern of results was obtained in Rios et al. (2017) in that examinee-level filtering biased the mean scores in their study.
Discussion and conclusions
Reporting score trends across years is crucial for large-scale assessments, such as PISA and NAEP. Thus, the results of this study on differences in disengagement across years, which pertain to trend reporting, provide important information for such assessments.
There are many factors that affect the impact of disengagement on item parameters. One of the main factors, and one that is the focus of this study, is the percentage of disengaged responses or examinees. Another factor is the method used to detect disengagement. In this study, we used an enhanced item-level disengagement method that is based on both response time and response actions. A third related factor is how the disengaged responses are handled. To illustrate this, we used both response-level and examinee-level filtering. The results showed that different patterns were observed in changes in the a and b parameter estimates when response-level and examinee-level filtering were applied.
To elaborate on our method for detecting disengaged responses, we chose a conservative approach that detected only the responses that had the highest likelihood of displaying disengagement based on their response time and number of actions. Wise (2017) noted that taking a more conservative approach (i.e., failing to detect some potentially disengaged examinees) is preferable to one that falsely detects examinees as disengaged, given that both errors cannot be minimized simultaneously. We examined two ways in which we could relax our constraints: (1) removing the constraint of a 10-second maximum for the response time threshold; and (2) removing the constraint on the minimum number of actions. We tested the impact of these modifications and found that removing the first constraint did not make much difference in the classification of responses as disengaged and removing the second constraint led to unreliable results. The highest value for 10% of the average total time variable was 15 seconds in our sample, which did not introduce much change to the threshold. The scores for the examinees who were labeled as disengaged based on only their response time were close to the average scores of engaged responses, suggesting that some examinees could be falsely labeled as disengaged.
In this study, we used the same thresholds in both 2015 and 2018 for comparison purposes. We were able to form common thresholds by merging the datasets from both years. Our goal was to detect disengagement based on the same criteria in both years. In some cases, the threshold value was not a perfect fit for either the 2015 or 2018 dataset. In these cases, we tried to specify a threshold value that would be equally imperfect for both years so that any potential error in classification would impact the two datasets equally.
As to the filtering method, we used both response-level and examinee-level filtering to illustrate the impact of filtering on the item parameters. With examinee-level filtering, the students removed were the same for each item; for response-level filtering, the students removed were different for each item. Wise (2009) suggested that response-level filtering is appropriate when examinees are not being compared based on their raw scores. The advantage of response-level filtering is that it retains as much data as possible by keeping valid responses that are not impacted by disengagement.
Overall, we did not find any serious impact of response-level filtering on the average percent correct of items, P+. Our results are in accordance with previous studies suggesting that response-level filtering does not impact mean test scores (Kong et al., 2007; Wise, 2006). We observed nonignorable changes in item parameters for only a few items after applying examinee-level filtering. Hauser and Kingsbury (2009) found that proficiency estimates were not impacted by examinee-level filtering if the percentage of disengagement did not exceed 20%. In our study, the percentage of disengaged examinees was low, which may explain why we observed changes regarding item parameter estimates with disengagement and without disengagement only in a few items.
Researchers should consider ability distribution of disengaged examinees when they are deciding whether examinee- or response-level motivation filtering would be more appropriate. Specifically, they should be wary of the assumption that disengaged responses are independent of both item (i.e., item difficulty) and test taker (i.e., the latent trait, θ) characteristics, when applying motivation filtering (Rios et al., 2017). Wise’s (2009) assumption about examinee-level filtering was that student effort is unrelated to true proficiency. Thus, removing examinees would result in either an underestimation or overestimation of item parameters depending on whether the examinees removed were of high or low ability, if this assumption was violated. In our study, we assumed that item and examinee characteristics are independent of the disengaged behaviors. We observed that the items with the highest percentage of disengagement were not the most difficult items, which echoed the findings from previous studies that item difficulty is not significantly related to examinee effort (Rios et al., 2017; Wise, 2006; Wise & Kingsbury, 2016). This also supported our assumption that there was not a linear relationship between the disengagement levels we observed and the difficulty of the items we examined. As for test taker ability, our analysis focused on the 51 items common to both assessments; thus, we had limited knowledge with which to test our assumption about students’ abilities.
We did not find any patterns in the disengaged responses based on examinee demographics. For each item in both 2015 and 2018, male and female examinees equally provided disengaged responses. In both years, 3% of the examinees in each grade were found to be disengaged. Moreover, the level of disengagement was the same among students who repeated grades as among those who did not repeat grades.
Generally speaking, researchers should consider the possibility that test takers use rapid guessing as a test-taking strategy. Test takers can skip items that are difficult for them to allocate more time and energy to the items that they have a higher probability of answering correctly with reasonable effort. Such calculated behavior may result in observing more disengagement in difficult items and some lower-ability students displaying disengaged behavior more frequently. In such cases, response-level filtering would reduce the amount of information gained on the difficult items that were subject to higher levels of disengagement, and examinee-level filtering would eliminate a distinct subgroup of examinees. Therefore, researchers should examine the patterns in the disengagement before deciding to apply response-level or examinee-level filtering.
Finally, researchers should bear in mind that differences in scores between disengaged and engaged examinees are not informative for gauging the impact of disengagement on item parameters. Researchers have used higher average scores from engaged examinees compared to disengaged examinees as evidence of correctly detecting examinees. Consequently, P+ of the items is expected to increase after eliminating disengaged responses. However, higher score differences between engaged and disengaged students do not necessarily correlate with higher differences in P+. For example, Rios et al. (2017) reported that examinee-level filtering artificially inflated the true mean score when ability was related to disengaged responding. In particular, when examinee-level filtering is applied (or item parameters are estimated with a model-based approach such as IRT), the impact of disengagement on item parameters is not obvious because the item parameter estimates from one item will be influenced by the disengaged responses from the other items.
Overall, this study provides an example of how to detect and handle disengagement in a large-scale assessment administered in two years in the United States based on PISA data. The study highlights the differences in disengagement in both years as well as the implications of the decisions made for handling scores received under disengagement on item difficulty and discrimination. Since the study uses data from one country, researchers should be cautioned against generalizing the results to student populations in other countries and regions. It would, of course, be desirable for future work to compare disengagement across years with data from other countries.
Future research should also consider the potential effects of examinee-level filtering more carefully; for instance, by suggesting cut-offs for the percentage of disengaged examinees to be used in the filtering. Finally, we would like to note that this study was the first to follow a data-driven approach to set the threshold for the number of actions. We adapted the visual approach to specify the threshold at the beginning of a spike, but future studies may apply other methods for setting the threshold.
Availability of data and materials
All results generated for this study are included in this published article. The datasets analyzed during the current study are available publicly in https://www.oecd.org/pisa/data/.
Classical Test Theory
Generalized Partial Credit Model
Item Response Theory
National Assessment of Educational Progress
Organisation for Economic Co-operation and Development
Programme for the International Assessment of Adult Competencies
Programme for International Student Assessment
Bovaird, J., & Embretson, E. (2006, August). Using response time to increase the construct validity of trait estimates. Paper presented at the 114th Annual Meeting of the American Psychological Association, New Orleans, LA.
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29.
Debeer, D., Bucholz, J., Hartig, J., & Janssen, R. (2014). Student, school, and country differences in sustained test-taking effort in the 2009 PISA Reading Assessment. Journal of Educational and Behavioral Statistics, 39, 502–523.
Goldhammer, F., Martens, T., & Lüdtke, O. (2017). Conditioning factors of test-taking engagement in PIAAC: An exploratory IRT modelling approach considering person and item characteristics. Large-Scale Assessments in Education, 5(18), 1–25.
Guo, H., Rios, J. A., Haberman, S., Liu, O. L., Wang, J., & Paek, I. (2016). A new procedure for detection of students’ rapid guessing responses using response time. Applied Measurement in Education, 29(3), 173–183.
Hauser, C., & Kingsbury, G. G. (2009, April). Individual score validity in a modest-stakes adaptive educational testing setting. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
Kong, X. J., Wise, S. L., & Bhola, D. S. (2007). Setting the response time threshold parameter to differentiate solution behavior from rapid-guessing behavior. Educational and Psychological Measurement, 67(4), 606–619.
Lee, Y.-H., & Jia, Y. (2014). Using response time to investigate students’ test-taking behaviors in a NAEP computer-based study. Large-Scale Assessments in Education, 2(1), 8–41.
Ma, L., Wise, S. L., Thum, Y. M., & Kingsbury, G. (2011). Detecting response time threshold under the computer adaptive testing environment. Paper presented at the annual meeting of the National Council of Measurement in Education, New Orleans, LA.
Organisation for Economic Co-operation and Development (OECD). (2000). Knowledge and skills for life: First results from PISA 2000. OECD Publishing.
Organisation for Economic Co-operation and Development (OECD). (2017). PISA 2015 technical report. Paris: OECD Publishing. Retrieved from https://www.oecd.org/pisa/data/2015-technical-report/PISA2015_TechRep_Final.pdf
Organisation for Economic Co-operation and Development (OECD). (2022). Chapter 9: Scaling PISA data. In PISA 2018 technical report. Paris: OECD Publishing. Retrieved from https://www.oecd.org/pisa/data/pisa2018technicalreport/Ch.09-Scaling-PISA-Data.pdf
R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Retrieved from http://www.r-project.org/index.html
Rios, J. A., & Deng, J. (2021). Does the choice of response time threshold procedure substantially affect inferences concerning the identification and exclusion of rapid guessing responses? A meta-analysis. Large-Scale Assessments in Education, 9(18), 1–25.
Rios, J. A., & Guo, H. (2020). Can culture be a salient predictor of test-taking engagement? An analysis of differential non-effortful responding on an international college-level assessment of critical thinking. Applied Measurement in Education, 33(4), 263–279.
Rios, J. A., Guo, H., Mao, L., & Liu, O. L. (2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not? International Journal of Testing, 17(1), 74–104.
Sahin, F., & Colvin, K. F. (2020). Enhancing response time thresholds with response behaviors for detecting disengaged examinees. Large-Scale Assessments in Education, 8(5), 1–24.
Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34(3), 213–232.
Sundre, D. L., & Wise, S. L. (2003, April). Motivation filtering: An exploration of the impact of low examinee motivation on the psychometric quality of tests. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.
Wise, S. L. (2006). An investigation of the differential effort received by items on a low-stakes computer-based test. Applied Measurement in Education, 19(2), 95–114.
Wise, S. L. (2009). Strategies for managing the problem of unmotivated examinees in low-stakes testing programs. The Journal of General Education, 58(3), 152–166.
Wise, S. L. (2015). Effort analysis: Individual score validation of achievement test data. Applied Measurement in Education, 28(3), 237–252.
Wise, S. L. (2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52–61.
Wise, S. L., & DeMars, C. E. (2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43(1), 19–38.
Wise, S. L., & Gao, L. (2017). A general approach to measuring test-taking effort on computer-based tests. Applied Measurement in Education, 30(4), 343–354.
Wise, S. L., & Kingsbury, G. G. (2016). Modeling student test-taking motivation in the context of an adaptive achievement test. Journal of Educational Measurement, 53(1), 86–105.
Wise, S. L., & Kong, X. J. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163–183.
Wise, V. L., Wise, S. L., & Bhola, D. S. (2006). The generalizability of motivation filtering in improving test score validity. Educational Assessment, 11(1), 65–83.
Wise, S. L., & Ma, L. (2012, April). Setting response time thresholds for a CAT item pool: The normative threshold method. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, Canada.
Yamamoto, K. (1995). Estimating the effect of test length and text time on parameter estimation using the HYBRID model (ETS Research Report RR-95-02). Educational Testing Service.
The authors thank Markus Broer, Young Yee Kim, and Xiaying “James” Zheng for their review of initial findings and methods. The authors also thank Amy Rathbun for their support in publishing this manuscript.
The concept and preliminary results of this paper were developed and conducted during the 2020 NAEP doctoral internship program administered by AIR and funded by NCES under Contract No. ED-IES-12-D-0002/0004. The authors did not receive funding for finalizing the results and preparing this manuscript. The views, thoughts, and opinions expressed in the paper belong solely to the authors and do not reflect NCES position or endorsement.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Kuang, H., Sahin, F. Comparison of disengagement levels and the impact of disengagement on item parameters between PISA 2015 and PISA 2018 in the United States. Large-scale Assess Educ 11, 4 (2023). https://doi.org/10.1186/s40536-023-00152-0
- Disengaged responses
- Process data
- Item parameter estimates
- Large-scale assessment