Examining successful and unsuccessful time management through process data: two novel indicators of test‑taking behaviors

,

Page 2 of 14 Papanastasiou and Michaelides Large-scale Assessments in Education (2024) 12:3 According to Lundgren and Eklof (2020), test-taking motivation is a specific motivation to maximize performance on a test.To achieve this goal when taking a test, examinees will have to expend effort and regulate the necessary skills, knowledge, time and resources.Empirical studies on test-taking effort have originally approached test-taking effort via self-reports.However, behavioral indicators, primarily automatically-recorded item response times from computerized tests, have been shown to be less prone to response biases, less intrusive, and available at the item-level (Eklöf, 2010).Time spent on reading, processing and giving an answer to an item is considered as a reliable behavioral measure of engagement with the test.Much of this research has been initiated by Schnipke and Scrams (1997), followed by Steven Wise and colleagues who developed the use of (a) rapid guessing, i.e. providing a response in a very short time interval, as a manifestation of disengaged behavior when responding to a test item; and (b) response time effort as an aggregate indicator of effortful behavior on the whole test (Wise, 2017;Wise & Kong, 2005).
Recently, researchers have shown interest in student effort and engagement with international large-scale assessments.Using methods to identify rapid guessers, Michaelides et al. (2020), and Pools and Monseur (2021) have shown that response time effort is correlated with performance in PISA and weakly correlated with achievement motivation and enjoyment variables.Guo and Ercikan (2020), Michaelides and Ivanova (2022), and Rios and Soland (2022) have also looked at cross-country differences that exist in rapid guessing.
Implementation of response time measures to identify rapid guessing behavior (as a dichotomous variable) at the item level requires the selection of a threshold.Examinees who provide a response at a time below the stated threshold are characterized as rapid guessers, not engaging in solution behavior (Wise & DeMars, 2006).Proposed ways to determine a threshold for rapid guessing include a fixed time point, common for all items (Wise et al. 2010), or judgmental decisions based on item length or the inspection of the response time frequency distribution (Wise & Kong, 2005;Setzer et al., 2013).Other approaches incorporate performance on the item depending on response time (Guo et al., 2016), modeling with mixture models or IRT (e.g.Ulitzsch et al., 2020), and normative methods based on a proportion of the average time expended on an item (Wise & Ma, 2012).Comparisons about an optimal threshold identification approach have been inconclusive (Wise, 2019), and there is no consensus on a preferred method as there are strengths and weaknesses for each one (cf., Rios & Deng, 2021;Soland et al., 2021).Simpler methods such as the 5-s rule for all items are easy to implement and provide thresholds for all items but are criticized for higher misclassification errors.For example, a proficient examinee may respond rapidly but thoughtfully to an easy item and could be misclassified as a disengaged rapid guesser if he or she provided a response in less than 5 s; or a disengaged examinee who may glance over an item before moving slowly to the next item could be identified as a non-rapid guesser.Unavoidably, trying to reduce the possibility of false-positive results by changing the threshold, increases the possibility of false-negatives (Wise, 2017).Moreover, studies have predominantly looked at multiple-choice items, although some initial proposals have been recently put forth for omitted and constructed responses (Ivanova et al., 2020;Wise & Gao, 2017).More sophisticated methods that take performance into account appear more valid, but rest on distributional assumptions and often do not converge or do not provide thresholds for all items (cf., Soland et al., 2021;Ulitzsch et al., 2020Ulitzsch et al., , 2022)).
Further information about response events is available in digital assessments from log files.Examining timing data from log files alone, does not always provide adequate indication of an examinee's test-taking behavior.The time that a student might need to respond to a test item depends on various factors, including those of the examinee's overall ability, examinee test-taking behaviors or strategies, item characteristics (e.g.idem difficulty, item length, auxiliary visual material) as well as any interaction of these factors.Two students who spent very little time on a test item, might have done so for numerous reasons.One student might have spent very little time because the question was very easy for them, while another student might have done so because they did not want to spend any effort on a question that was too difficult for them.Consequently, timing data can be more informative when examined in relation to other variables.
The purpose of the current study is to present and evaluate two novel indicators of examinee test-taking behaviors, that utilize a combination of examinee response and timing data, to better understand and describe test-taking effort.To calculate the proposed indicators, the first step includes the calculation of the MedianT i , which corresponds to the median amount of response time for answering each of the multiple-choice items i, i = 1,…,K that were administered in a test booklet.At a second stage, a deviation score is calculated for each student j who was administered item i by subtracting the median amount of response time for item i from the students' response time T ij for the same item.Based on these deviation scores, a cumulative indicator is calculated for each student for each of the new indicators as follows: 1) For items i that were answered incorrectly by person j in less time than the median response time, the absolute value of this time difference was added to the Unsuccessful Time Management indicator (UTM) for the examinee as follows: If item i was answered incorrectly by person j, then Since such items have been answered incorrectly, it is likely that the students made less than adequate effort to answer them correctly since they provided a response in less time than the median.
2) For items i that were answered correctly by person j in less time than the median response time, the absolute value of this time difference was added to the Successful Time Management indicator (STM) for the examinee.If item i was answered correctly by person j, then (1) Papanastasiou and Michaelides Large-scale Assessments in Education (2024) 12:3 Since such items have been answered correctly, it is likely that the students were either already proficient on the specific content and thus did not need additional time to respond to them, or were just lucky in a rapid guess.
Based on these indicators, the research questions of this study, that examine the indicators of "Successful Time Management" (STM) and "Unsuccessful Time Management" (UTM), are the following: Such indicators can be used for various purposes.For example, they could be used to obtain a more detailed picture of the students' test-taking behaviors as well as describe the effort they put on the test, conditioning on the accuracy of their responses.In addition, by studying their association with other correlates of effort, it may be possible to identify test design features that can be improved.These indicators will also enrich the field of measurement by moving beyond the examination of rapid responses identified in relation to thresholds that classify students in rapid guessing or not rapid guessing groups (Wise & DeMars, 2005;Wise & Kong, 2005).These scores, which are on a continuous scale of easily interpretable time units (seconds), represent the amount of fluency and efficiency of examinees in the case of STM, or the lack thereof in the case of UTM, while responding to the items in the course of a test session.Finally, they hold the potential to help strengthen the validity of low-stakes tests such as the ones administered by the IEA where student motivation is a potential concern (Baumert & Demmrich, 2001).On a more applied level, educators and policy makers could also utilize such results in the future, to examine factors that can improve student engagement during test-taking.

Methods
The population of the study included grade 4 students from the USA.The sample that was utilized for the analyses in the current study, included the students who were administered Booklets 7 and 8 in e-TIMSS 2019.Booklet 7 was randomly selected as a booklet which started with mathematics items, while Booklet 8 started with science items.This sample included 1250 students, of which 49.44% were female.The average age of the students was 10.26 years (SD = 0.42).The variables from TIMSS used for the current study were obtained from the grade 4 student achievement data files, as well as the student context data files.The information obtained from the student achievement data files were the examinee item responses on multiple-choice items graded as correct or incorrect, the five plausible values (PV) in mathematics, the timing of students on each mathematics multiple-choice item that was administered to them, the examinee benchmark levels, along with a special process variable from the e-TIMSS dataset, titled mathematics (or science) responder classification.The grade 4 responder classification variable categorized students based on the patterns of not-reached items (Fishbein et al., 2021) into one of three distinct categories; so responders are classified based on whether they have reached all items on the test, whether they have run out of time before completing the test, and whether they stopped responding while they still had time to complete the test.From the student context data files, two motivational scales were obtained: Students liking mathematics scale, Student confidence in mathematics scale, and the corresponding scales for the science test.The student confidence in variable was created based on nine items measuring confidence in mathematics and science, separately for each subject.The students like learning scale was created based another nine questionnaire items corresponding to each subject from the student background questionnaire.
The data for the study included five benchmark levels per student in mathematics to correspond to the five PVs in the subject, as well as five benchmarks for each student for science.To be able to present the results of the STM and UTM indicators by benchmark levels, a decision was made to identify the median benchmark for each student, for each subject.Therefore, the median benchmark was specifically created for each student to avoid presenting results separately for each PV.
Finally, for each examinee, we characterized an item response as extreme rapid guessing if it was provided within 3 s of the item appearing on the screen, under the assumption that TIMSS items cannot be answered even with partial effort by 4th-graders in such a brief time interval.Then, we counted the number of items on which the extreme rapid guessing behavior appeared-an indicator similar to but opposite than Response Time Effort (Wise & Kong, 2005).
The analyses were mostly performed with descriptive statistics and inferential statistics.All analyses incorporating plausible values (PV) were conducted using the International Database Analyzer (version 5.0.23)developed by the International Association for the Evaluation of Educational Achievement (IEA), and utilized student weights in the analyses.This specialized software tool facilitated accurate handling and interpretation of PVs, ensuring robustness in the findings.Additionally, the analyses were replicated across two key academic subjects: mathematics and science.

Results
The descriptive statistics of the STM and UTM indicators which have been created based on the USA multiple-choice e-TIMSS items for grade 4, are presented below.According to Table 1, the percentage of students who utilized less time than average on at least one of their items for mathematics was 89.52% for the STM indicator, and 79.60% for the UTM indicator.The corresponding percentages for science were equal to 89.52% and 82.80%.Although slight differences are observed between booklets 7 and 8, in general, there tends to be a higher percentage of students engaging in STM compared to UTM for both subjects.In terms of the magnitude of these differences, the medians of the STM tended to be higher than the median of the UTM indicator for both booklets.This resulted in an overall median of 27.96 s for STM in mathematics, 21.58 s in UTM for mathematics, and 30.39 s and 16.74 s respectively for science.
The correlation between the STM and UTM indicators for mathematics equals -0.05 (p = 0.088), and for science it equals − 0.03 (p = 0.281), suggesting no association between the two indicators (Table 2).However, the correlation between the mathematics and science STM variables equaled 0.42 (p < 0.001), and 0.40 (p < 0.001) for UTM.These results clearly indicate that there are similarities in the patterns of time use between academic subjects.The correlations between the STM and UTM variables and the time of last response were all negative, strong in size for STM, moderate for UTM, and statistically significant.The largest correlation was observed between the STM and the time of last response in science (r = − 0.58, p < 0.001), while the smallest was between UTM in science and time of last response in science (r = − 0.30, p < 0.001).This finding verifies the validity of the indicators since it would be expected that examinees who ended the test sooner would have higher levels of STM and UTM; it does not however discriminate between the two indicators.
A series of tests related to construct validity were conducted, by relating them to other variables to understand these indicators further.Such variables are those of median  benchmark, mathematics responder classification, and examinee achievement, as represented by the five plausible values (PV).Table 3 presents the breakdown of the two indicators by median benchmark level.For higher benchmark levels, the successful time management indicator is higher, while the unsuccessful time management indicator is lower.Students in higher benchmark levels tend to have higher levels of STM and lower levels of UTM for both, mathematics and science.
The variable of Mathematic Responder Classification placed students into four categories, based on their overall timing behavior during the test.This variable included the categories of (a) Reached all items; (b) Ran out of time; (c) Stopped responding; and (d) Could not be classified.In the current sample, only three students were placed in the category "Could not be classified" and were therefore not included in the analyses.Based on this classification, the majority of the students who managed to reach all items on the test were also the ones with the largest median STM and UTM variables (Table 4).Most likely, this occurred due to their attempts to go through the test without many delays, to make sure that they would manage to reach the end of the test.These were also the students with the highest average achievement in terms of their Plausible Values (PVs).Overall, however, the percentage of students who were classified in the other categories was very small which made it not possible to reach any robust conclusion regarding these cases of students, except that their STM indicator was at or near zero which implies a less than optimal test-taking strategy.
With the IEA Database Analyzer we examined the correlation between each of the two indicators with the plausible values (PV).The results for mathematics indicated that the correlation between the Successful Time Management indicator and the five PVs equaled r = 0.35 (se = 0.04), while the correlation between the Unsuccessful Time Management indicator and the PVs equaled r = -− 0.53 (se = 0.02).In science, the corresponding correlations equaled r = 0.41 (se = 0.03) with STM and r = -− 0.44 (se = 0.04) with UTM.Based on this result, it appears as though higher achieving students tend to be more frequently engaged with increased levels of STM, and with lower levels of UTM; they tended to have more unused time on their correct answers (thus, most likely being an indicator of mastery of the test content), and with less unused time for their incorrect answers (meaning that on items they did not do well, they were not responding hastily).
Table 5 presents the breakdown of the indicator-achievement relationship, broken down by benchmark.This was examined in order to examine whether the types of relationships between ability (as indicated by the PVs) and the two indicators differ among the different benchmark levels.As presented in Table 5, the correlations between the relevant variables were quite small.Most likely this has occurred due to the restriction of range of the achievement levels within each benchmark.The only correlation that was statistically significant at the 0.05 level was at benchmark level 3, between STM and the PV in science.Within this benchmark, the students who had higher levels of achievement, also had higher levels of Successful Time Management, by answering questions correctly in less time than average.In mathematics, none of the correlations were statistically significant at any benchmark level.
The STM and UTM indicators were also correlated with student motivational variables to further examine their validity.The selected variables were those of Students like learning each subject, and Students confident in the subject (Table 6).Of the two motivational variables, the correlation of Students like learning with STM was very small and statistically not significant for both subjects (0.06 and 0.02 with STM).The correlation  (2024) 12:3 was larger between Student confidence and STM, which equaled 0.24, (p < 0.001) in mathematics and 0.12 (p < 0.001) in science.This result is not surprising, since self-confidence tends to be more strongly aligned with performance compared to enjoyment with the subject (e.g.Michaelides et al., 2019).The correlation between UTM and liking the two subjects was equal to − 0.12 (p < 0.01) for both subjects.However, the correlation between UTM and being confident in mathematics was equal to − 0.24 (p < 0.001) for mathematics and − 0.12 for science (p < 0.01).
The total number of examinees who responded in less than three seconds was estimated as a proxy to the lack of response time effort.In mathematics, there were 45 who responded to at least one item in less than 3 s, while in science there were 55.Among the small number of examinees who had at least one very rapid response, the correlation between the number of extreme rapid guesses with UTM in mathematics was 0.30 (p = 0.05).As would be expected, the more examinees engaged in extreme rapid guessing, the more likely they would accrue time because of rapid incorrect response behavior.The corresponding correlation in the science data was 0.14 and non statistically significant.The number of extreme rapid guesses was also non significantly correlated with STM in both subjects.

Discussion
By using classical test theory, a large proportion of assessment researchers, educators, and psychometricians have focused on correct, incorrect, and partially correct answers as indicators of examinee proficiency.Recent technical and methodological advancements in the area of computerized large-scale testing, however, have provided us with opportunities to better understand the testing process through the utilization of process data (Papanastasiou & Eklöf, 2020).Process data provide additional sources of information obtained by examinees during the test-taking process, and they hold the potential to revolutionize the field of testing.However, this field of study is relatively recent.Moreover, due to the fact that process data have only recently started to be collected by the IEA, few efforts have been made to combine process data with additional test-taking behaviors.Since no unified and easily understandable indicator exists that combines test-taking behaviors with process data, this study aimed to create two novel indicators of test-taking behaviors that combine accuracy and timing data in order to describe testtaking effort.These indicators that are easy to calculate and comprehend, can easily be generalized to any other study that is administered electronically, and for which process data are available.
An additional originality of these indicators is that their estimation is dependent on the time that students spent on each item, while controlling for the correctness of their response to that item.Consequently, by incorporating the accuracy of a response in the estimation procedure, the misclassifications that were likely to occur with other timebased rapid-guessing indicators, are avoided in the current approach.For example, a highly competent student who might have correctly answered a question very quickly, should not automatically be considered (misclassified) as a rapid guesser.Moreover, by interpreting the timing data based on whether a response was correct or incorrect, and by comparing the response time to the median, the possibility of having future examinees take advantage of such behaviors is eliminated.For example, it would be difficult for examinees to know in advance whether their response time was above or below the median or for them to know for sure whether their response was correct or incorrect in an attempt to demonstrate either high levels of STM or low levels of UTM accordingly.As a result, these indicators are less susceptible to manipulation by examinees, which is a great concern related to the use of process data (Bennett, 2018).
The medium size correlation that was observed between STM and UTM suggests that students who utilized less time than average on incorrect answers, also did so to some extent on correct answers as well.The fact that these occurrences mostly occurred with the students who managed to complete all test items in the allocated time, might be an indication of a test-taking strategy, to make sure that they had enough time to complete the test.However, the correlation between the two indicators was not large enough to universally claim that utilizing less time than average is purely based on a strong "speededness" trait.This is further supported by the fact that the behavior of responding in less time than average was related to other explanatory variables.For example, students who spend less time than average in correct answers tend to be in higher benchmark levels, indicating that this might have occurred since they had mastered the item content and did not need much time to respond to such items correctly.Also, students who spend less time than average in incorrect answers tend to be in the lower benchmark levels, which could be an indicator of making less effort on the test.This was further verified by the result that student confidence in mathematics was more highly correlated with STM rather than with UTM.
Overall, although narrow in range, STM was positively correlated with test performance, it tended to occur with students who were in the higher benchmark levels, and who also had more confidence in mathematics.This further verifies that this indicator could be considered as an indication of mastery of the content by the examinee.In contrast, UTM occurred more frequently and to a larger extent than STM.This indicator occurred more frequently with students in the lower benchmark levels, and it was not correlated with confidence in mathematics.Also, the fact that this indicator was larger for the students who stopped responding to the test, and was moderately correlated with the extreme rapid guessing frequency, further supports that for these students, this indicator is related to their lack of effort on the test.
Therefore, using less time than average on a test occurs for various reasons.Although to some extent using less time on test items might be an indication of a Papanastasiou and Michaelides Large-scale Assessments in Education (2024) 12:3 test-taking behavior that ensures that all items can be completed in the allotted time, this alone does not describe the full situation.One student might have used less time than average because they had clearly mastered the item content and did not need much time to answer it correctly, while another student might have used the same amount of time because they did not put much effort into the question, and eventually answered it incorrectly.As a result, by examining timing data in relation to whether an item was answered correctly or incorrectly in less time than average, can provide us with more detailed information regarding test-taker behavior.Such information can be used to describe examinees in IEA studies, beyond merely looking at their proficiency level.These novel indicators can also describe the ways in which each student took the test.For example, it will be possible to differentiate students within a country who mostly responded carelessly to many items, and omitted many other items, from students in another country with similar levels of proficiency, who utilized all of their available time, and viewed the difficult items many times in order to answer them.This is especially useful for international studies which are lowstakes, and in which student motivation and test-taking effort are a potential concern (Baumert & Demmrich, 2001).Educators and policy makers could also utilize such results to examine factors that can improve student engagement overall during test-taking.These indicators might also differentiate the students who managed to obtain high scores with a high level of the STM indicator (since they managed to respond to most questions in a much lower time than average, without being careless rapid guessers), from other students of similar scores who were persistent, utilized all of their available time, and viewed the difficult items many times in order to answer them correctly.On the other side of the continuum, examinees with large UTM were students who answered many items rapidly and incorrectly, so this may be a way to identify those with a general disengagement with the test content.
Finally, these unified indicators can also be used to demonstrate the degree of validity of the IEA studies, since they can be used to describe examinee behaviors in more detail, without automatically assuming that all responses are thoughtful, or that all rapid responses are always rapid guesses, and indications of careless behaviors.This is further supported by the American Educational Research Association et.al ( 2014) which stated that test-taking efforts need to be taken into consideration as important validity factors when interpreting scores from low-stakes assessments.
Beyond the results presented in the current study, further research should be performed, to examine these indicators in more detail.For example, how do these variables perform in other subject areas in other studies?Would similar results be obtained from the data for grade 8 students or from students in other countries?What are the examinee characteristics, or country variables that could help explain the variations that exist in the magnitude of these indicators?Finally, additional research should also be performed to determine how these indicators can be calculated in polytomous and Problem Solving and Inquiry (PSI) items in TIMSS, since the current study has only examined exaninee timing on multiple-choice items.Papanastasiou and Michaelides Large-scale Assessments in Education (2024) 12:3

Conclusions
The potential of process data in reshaping the way we perceive and evaluate testing processes cannot be overstated.The introduction of the STM and UTM indicators, which effectively combine accuracy and timing data, presents a promising way forward.They allow for a deeper, more nuanced understanding of test-taking behaviors, especially in relation to speed (responding in less time than the median response time) and accuracy (whether an item is correct or incorrect), and can be adapted across various testing environments.These indicators could provide essential insights to differentiate between students who answer questions quickly due to mastering content, and those who do so due to lack of effort.So STM appears to signify mastery of the test content and is more prevalent among confident students in higher benchmark levels.In contrast, UTM seems to signify a lack of test-taking effort, appearing more frequently in lower benchmark students and those disengaging from the test.These differentiated insights of test-taking behaviors can have profound implications regarding the use of timing data which challenge the assumption that rapid responses are only indicative of careless behaviors.As a result, these indicators can enhance the validity of study findings by considering examinee behaviors in more clarity and in more depth.
However, more research is needed to fully understand these indicators.Questions remain about how these variables perform across different subjects, age groups, and cultural contexts, and how they could be applied to different types of test items.Despite these limitations and the need for further research, this study represents an important step forward in the understanding of test-taking behaviors.It opens up a rich new dimension of test analysis that goes beyond the identification of careless responding, offering a far more sophisticated understanding of examinee behaviors and test-taking strategies, reliability, and usefulness of computerized large-scale testing.

Table 1
Descriptive statistics of Successful and Unsuccessful Time Management indicators in seconds

Table 2
Correlation coefficients between STM and UTM indicators across subjects

Table 3
Descriptive statistics of the median of the STM and the UTM indicators by median benchmark

Table 4
Descriptive statistics of the STM and the UTM indicators by mathematics responder classification

Table 5
Correlations between Achievement Levels by Benchmark and the Successful and Unsuccessful Time Management Indicators Papanastasiou and Michaelides Large-scale Assessments in Education * p≤ 0.05

Table 6
Correlations of STM and UTM with motivational variables