Using response time to investigate students' test-taking behaviors in a NAEP computer-based study
© Lee and Jia; licensee Springer. 2014
Received: 18 April 2014
Accepted: 13 August 2014
Published: 17 September 2014
Large-scale survey assessments have been used for decades to monitor what students know and can do. Such assessments aim at providing group-level scores for various populations, with little or no consequence to individual students for their test performance. Students' test-taking behaviors in survey assessments, particularly the level of test-taking effort, and their effects on performance have been a long-standing question. This paper presents a procedure to examine test-taking behaviors using response time collected from a National Assessment of Educational Progress (NAEP) computer-based study, referred to as MCBS.
A five-step procedure was proposed to identify rapid-guessing behavior in a more systematic manner. It involves a non-model-based approach that classifies student-item pairs as reflecting either solution behavior or rapid-guessing behavior. Three validity checks were incorporated in the validation step to ensure reasonableness of the time boundaries before further investigation. Results of behavior classification were summarized by three measures to investigate whether and how students' test-taking behaviors related to student characteristics, item characteristics, or both.
In the MCBS, the validity checks offered compelling evidence that the recommended threshold-identification method was effective in separating rapid-guessing behavior from solution behavior. A very low percent of rapid-guessing behavior was identified, as compared to existing results for different assessments. For this dataset, rapid-guessing behavior had minimum impact on parameter estimation in the IRT modeling. However, the students clearly exhibited different behaviors when they received items that did not match their performance level. We also found disagreement between students' response-time effort and self reports, but based on the observed data, it is unclear whether the disagreement was related to how the students interpreted the background questions.
The paper provides a way to address the issue of identifying rapid-guessing behavior, and sheds light on the question about students' extent of engagement in NAEP and the impact, without relying on students' self evaluation or additional costs in test design. It reveals useful information about test-taking behaviors in a NAEP assessment setting that has not been available in the literature. The procedure is applicable to future standard NAEP assessments, as well as other tests, when timing data are available.
Large-scale national and international survey assessments, such as the National Assessment of Educational Progress (NAEP), the Programme for International Student Assessment (PISA), and the Trends in International Mathematics and Science Study (TIMSS), have been used for decades to monitor what students know and can do. Those survey assessments are often referred to as low-stakes assessments as they aim to provide group-level scores for various populations, and students taking the assessments receive no academic credit and bear little or no consequences for their test performance. One long-standing question for low-stakes assessments is the level of student's engagement with the test and its effect on performance (Braun, Kirsch, and Yamamoto ; O'Neil et al. ), particularly for students at higher grade levels. Braun et al. () further suggested that differences in engagement with the test might be confounded with differences in the cognitive abilities to be assessed. From a measurement perspective, responses associated with disengagement could lead to model misfit and biased parameter estimation in item response theory (IRT) calibration and scoring (Wise and DeMars ; Wise and Kong ). To improve the quality of parameter estimation in measurement models, and therefore the validity of group score estimates, one solution is to identify responses from individual students showing disengagement with the test/items and remove them from the analysis.
A common approach to evaluating students' engagement with the test in low-stakes assessments is to employ self-report questionnaires. As noted by Wise and Kong (), one limitation of self-reporting is that it is difficult to ascertain how truthfully the students respond to the questionnaires. It is also doubtful how the students interpret the questions about their engagement with the test. For example, NAEP collects evidence from the questionnaires on whether students are trying hard on NAEP or are engaged. Twelfth graders who said they did not try hard on NAEP scored higher than those who said they tried harder or much harder compared to most other tests taken in the same year in school (National Assessment Governing Board ). Another approach is to conduct experimental studies to examine whether certain practices, such as offering monetary incentives, can effectively motivate students (Braun et al. ). This type of approach is often implemented on a small scale and not feasible to be employed on a regular basis.
A different school of thought in measuring student engagement is to distinguish students who demonstrate solution behavior from those showing rapid-guessing behavior. Generally, rapid-guessing behavior is represented as responses occurring so rapidly that students either do not have time to fully consider the item (i.e., test speededness) or do not give their effort. The accuracy of such rapid guesses is typically at or near the chance level, as the responses are essentially random. The response time (RT) of rapid guesses is usually very short relative to the amount of time required for the items. A few researchers have proposed IRT-based approaches to model item responses in considering the above mentioned test-taking behaviors (e.g., Bolt et al. ; Yamamoto and Everson ). Their main concern was how test speededness affected the estimation of IRT model parameters. (Hadadi and Luecht ) indicated that a better set of detection methods for identifying rapid-guessing behavior may be developed given the availability of detailed data of responses and RTs.
Identifying the RT boundary between solution behavior and rapid-guessing behavior for each item is a crucial step in the analysis, and several methods for threshold identification have been suggested. For instance, Kong et al. () recommended four different ways to decide RT thresholds. One of them uses a common three-second threshold for all items, which is easiest to implement. Another is to visually inspect the RT distribution of each item, which is a more widely used approach (e.g., DeMars ; Setzer et al. ). A gap in the distribution that clearly separates two groups of RTs suggests a possible threshold for the item. The third suggested approach is to set up the threshold based on the amount of reading required for an item. The last is to apply mixture models to fit the RT distribution of each item (also see Meyer ; Schnipke and Scrams ). Kong et al. () analyzed data from an Information Literacy Test using the common three-second rule and several variable-threshold methods, concluding that the common three-second rule performed slightly worse than the others. In a separate study, Hauser and Kingsbury () found it rather conservative to assign a common three-second threshold to all items when some items required more time than the others. They also noted that item thresholds should be unique to each item, depending on the time demand of an item. Ma et al. () further examined, among others, the use of mixture models and a non-parametric model with mathematics items of a computerized adaptive test, finding that neither model was practically useful. Wise and Ma () proposed a new variable-threshold method, called the normative threshold method, which defines the RT threshold of an item as a certain percent of the average RT. This method can be easily applied to assessments with large item pools. Conversely, it is possible to detect aberrant behavior of students using person-fit statistics (e.g., Karabatsos ; Meijer and Sijtsma ), many of which are designed to identify general types of aberrances in behaviors by studying patterns of responses (Lee et al. ). The research studies mentioned above used RTs to classify test-taking behaviors. There have also been studies that modeled responses and RTs jointly (e.g., van der Linden ), assuming a unimodal distribution for RTs with elaborate parameterization. Such models are appropriate for application in a single latent class (e.g., representing solution behavior) rather than identifying different test-taking behaviors. Readers may refer to Schnipke and Scrams () and Lee and Chen () for more exhaustive reviews of existing methods.
Large-scale survey assessments have been commonly administered in a paper-and-pencil (P&P) format. More recently, the programs such as NAEP and PISA have begun to consider computer-delivered testing formats. A special NAEP study, referred to as Mathematics Computer-Based Study (MCBS) herein, was conducted in 2011 to assess the benefit of multistage testing in the NAEP administration setting (Xu et al. ). The MCBS provided detailed timing data, which permits the investigation of student behaviors in terms of RT, as the students move through the test. It also helped answer a practical, long-term question—can we evaluate NAEP students' extent of engagement when timing data are available? Because the MCBS had time limits at both stages of the test (see the Methods section for more descriptions), test speededness should not be completely precluded, even though disengagement with the test is of primary concern in low-stakes assessments. Thus, this study focuses on general rapid-guessing behavior, possibly a result of disengagement with the test and test speededness. The observed effects of rapid-guessing behavior serve as an upper bound of the effects of disengagement with the test.
Because this research is the very first attempt to investigate student behaviors using RTs in NAEP assessments and in low-stakes educational survey assessments at large, we adopted a non-model-based approach to examining the properties of RTs and classifying behaviors. The paper reveals useful information about test-taking behaviors in a NAEP assessment setting that has not been available in the literature. To our best knowledge, the existing studies on this topic have typically focused on timing distributions with clear bimodal patterns and their methods may be less effective when the bimodal patterns are less obvious. As will be discussed, the recommended five-step procedure in this paper aims to (a) strengthen the existing approaches to address the circumstance where rapid-guessing behavior is of concern and to be assessed, but not all items have clear bimodal RT distributions, and to (b) identify rapid-guessing behavior in a more systematic manner. Step 2 of the procedure intends to identify time thresholds between rapid-guessing and solution behaviors with different threshold-identification methods. The method we recommended has been explored in the literature but was tailored in this paper to accommodate the new application to the MCBS data. In addition, new and existing validity checks were incorporated in the validation step to ensure reasonableness of the time boundaries before further investigation. The validity checks are important to achieve the objective of identifying rapid-guessing behavior in an effective manner while attempting to avoid misclassifying responses representing solution behavior as rapid guesses.
Data collection in the MCBS
The MCBS was administered under similar testing conditions (e.g., time limit, test length, and consequences to the students) as regular NAEP assessments, with the only major difference being the test format.
Items and test forms
Six test forms were administered in a two-stage design for the MCBS. The items were taken from an existing item pool to form five 25-minute blocks. The existing item pool was composed of 8th grade mathematics items from the 2011 NAEP P&P operational assessment.
Average P + for the blocks in the MCBS
Number of items
The experimental study involved two testing conditions, which led to two student samples: Multistage Test (MST) sample and Control sample. The test given to the MST sample was adaptive at the block level, while the test for the Control sample was linear. In both the MST sample and the Control sample, the students were classified into three performance levels–-low, medium, and high–-based on their performance on the routing blocksa (see Xu et al. , for more discussion about the classification rule in the MCBS). We name the three subgroups in the MST sample as MST_Low, MST_Medium, and MST_High. Similarly, Control_Low, Control_Medium, and Control_High refer to the three subgroups in the Control sample.
A national representative sample of 8,400 students participated in this study, where about 40% of the students were placed in the MST sample and 60% in the Control sampleb.
Defining RT thresholds
The RT for an item was defined as the total time spent on the itemc. As mentioned previously, we used RTs to classify student-item pairs as reflecting either solution behavior or rapid-guessing behavior and to investigate whether student behaviors were connected to person characteristics, item characteristics, or both. Accomplishing the objectives requires a threshold to be defined for each item that represents the RT boundary between the two test-taking behaviors. Following the convention of the literature, the RT threshold per item is defined as the lower bound for the amount of time an average student (in terms of speed) needs to fully consider the item and answer it. In other words, the RTs classified as representing rapid-guessing behavior may include the time spent browsing through items for some individuals, as well as the time spent reading and processing the materials for others (yet insufficient to answer the items). Clearly, the former may be equally short for all items, but the latter depends on the time demand of individual items.
One complication for the study is that the MCBS was a multistage test and there were two student samples (MST and Control) assigned to the second-stage blocks based on different rules. To determine if different RT thresholds are necessary for an item given to the students from different samples and performance groups, it requires knowledge about how item-level RTs distributed in different student samples and in the further disaggregated performance groups. Thus, we first examined the RT distributions for all items by block and by student sample, based on descriptive statistics of the RT distributions and histograms. This analysis showed whether and how the RT distribution of each item differed across student samples in each of the blocks. Second, we focused on data collected from the Control sample and the second stage blocks. Within each of these blocks, the Control sample was disaggregated into three performance groups and the associated RT distributions were compared numerically and graphically. This analysis suggested if the students of varying performance levels revealed different RT patterns when taking a block that may not match their performance level.
As noted in the Background section, several methods for setting RT thresholds have been considered. The literature also suggests that RT thresholds are likely to be item-specific (e.g., Hauser and Kingsbury )—for example, an item with more text, with a table and/or figure, or with more complicated stems may take more time to read and process. Preliminary analysis of our timing data led to consistent findings, and it appeared unreasonable to use a common threshold for all items. In particular, the three-second rule proposed by Kong et al. () did not work in the MCBS since almost no item had any RTs shorter than three seconds. In addition, our exploratory analysis revealed that about one third of the MCBS items had a unimodal timing distribution with a heavy left tail at short RTs, instead of the clear bimodal pattern shown in Figure 1 (see the Results section for more discussion). Existing methods that rely on the bimodal pattern, such as visual inspection of RT distributions and model-based approaches (i.e., mixture models and non-parametric models), did not work in this situation.
Two approaches can be considered in this circumstance. One is the normative threshold method (Wise and Ma ). The authors recommended the RT threshold of an item as 10 percent of the average RT for that item, up to a maximum threshold value of 10 seconds. The 10-second ceiling helped prevent extremely large thresholds, as sample mean is sensitive to outliers in RT values.
A summary of response time thresholds in the MCBS
Number of total/MC items
Number of items for each RT threshold
Because of the MST design, the RT distribution for some items, especially those in the second-stage blocks, may differ between the student samples (MST and Control) and among the further disaggregated performance groups. For each item with comparable RT distributions among the two student samples, the student samples were combined to produce one set of RT distributions and conditional P+, which led to one RT threshold defined. Alternatively, for items with somewhat different RT distributions for the two student samples, the RT distributions and conditional P + were examined separately for the student samples, and the plausible RT thresholds based on either sample were compared.
After the thresholds were determined, three validity checks were performed. First, we reviewed the actual items to make sure the thresholds corresponded to the reading load and complexity of individual items. For this validation step, the amounts of time that the majority of the students spent on an item (i.e., mode and median of the students' RTs) were used to quantify item complexity, and whether the item had tables or figures was taken into consideration. Second, we compared the P + associated with solution behavior with the P + associated with rapid-guessing behavior. Ideally, the former should be much higher than the chance level, while the latter should be close to the chance level. This validity check was first proposed by Wise and Kong () and employed in Wise and Ma () in developing their normative threshold method. Third, we evaluated the relationship between students' overall scores and their performance on each item conditioning on their behaviors on that item: For each item, we divided the students who took the item into 10 equal-sized (score) groups ordered by their scores, and then calculated the conditional P + for each score group. For students judged to be engaged in solution behavior on an item, their performance on the item should be positively related to their overall scores; however, such a relationship is not expected for students judged to be disengaged in solution behavior on that item. It is worth noting that, although the effort-moderated IRT model proposed in Wise and DeMars () was based on the idea of the third approach, it was first utilized in this study as a validity check for the identified RT thresholds.
For items with an RT threshold identified, a dichotomous index of item solution behavior (SB) was computed for each student-item pair (see, e.g., Wise and Kong ) as follows: SB = 1 if the RT was greater than the threshold; SB = 0 otherwise. This index was used as an indicator for solution behavior at the person-item level. We summarized the SB index for all students and items as a student by item two-way table, and examined the table marginally and conditionally to form three different measures, aiming to investigate whether and how students' test-taking behaviors related to student characteristics, item characteristics, or both:
Response-time effort (RTE; Wise and Kong ): Aggregate the two-way SB table marginally by student across items. This person-level measure represents the proportion of test items for which a student exhibited solution behavior. RTE can be used to categorize behaviors. One can then study whether RTE correlates with students' performance on the test or with different measures of engagement available for the students.
Response-time fidelity (RTF; Wise ): Aggregate the two-way SB table marginally by item across students. This item-level measure represents the proportion of students exhibiting solution behavior to an item. RTF can be used to examine whether items from a particular block or items with certain characteristics (e.g., in terms of IRT item parameters) tend to evoke rapid-guessing behavior. One can also study whether RTF correlates with item position and content area.
Conditional RTF: Compute the RTF conditional on students' performance level, rather than based on all students as is accomplished in the marginal analysis above. This is a new measure at the item-by-group level. It shows whether and how item RTF relates to students' performance level. (Conditioning RTF on performance was a natural choice with data from a multistage design. Other variables, such as demographic subgroups, may be considered in different applications).
Some analyses described above involve IRT model parameters. Typically, NAEP uses the three-parameter logistic (3PL) model for dichotomously scored items and the generalized partial credit model for polytomously scored items. In our study, item parameters were estimated by the maximum likelihood estimates, and the expected a posteriori (EAP) estimate was computed for each student as their MCBS score.
Response-time effort (RTE) for students
Question 1: How important was it to you to do well on this test? (Not very important, somewhat important, important, or very important)
Question 2: How hard did you try on test compared to other tests? (Tried not as hard, tried about as hard, tried harder, or tried much harder)
There were only 0.6% of missing responses to these questions, so they were excluded from the computation of the following correlation measures. As will be shown in the Results section, there was very high concentration on RTE = 1 in the MCBS, which may constrain some measures of association from having high values. To take this issue into account, we considered three measures of association that were defined differently: (a) Pearson correlation, (b) Spearman's rho, and (c) Goodman and Kruskal's gamma. Pearson correlation and Spearman's rho were based on continuous RTE values (i.e., higher for students more consistently showing solution behavior during the test) and scores of the response categories (1 = not very important/tried not as hard to 4 = very important/tried much harder). On the other hand, Goodman and Kruskal's gamma was based only on the number of concordant and discordant pairs of observations on an ordinal scale and should not be largely affected by the skewed RTE distribution. To compute Goodman and Kruskal's gamma, we classified the RTE values into three categories—"High RTE" if RTE was equal to 1, "Medium RTE" if RTE was between 0.8 and 1, and "Low RTE" if RTE was no greater than 0.8, and then treated the RTE categories and the responses to each background question as ordinal variables. The frequency distribution of the response categories was also tabulated. The cutoffs for classifying RTE into high, medium, and low are arbitrary and mainly for purposes of demonstration. Varying the cutoffs did not change the observed relationship between RTE and the background questions in the MCBS. In addition, we looked at the distribution of the responses to each background question per RTE category using clustered bar charts. The (overall) relationship between RTE and EAP score for the students was evaluated through Pearson correlation of the two variables; their relationship conditioning on student sample and performance level was further examined using scatter plots.
Response-time fidelity (RTF) for items
Because the SB index was defined only for items with an RT threshold, RTF was not available for all items in the MCBS. We first correlated item RTF with item position in each block. There were five content areas assessed in the MCBS—algebra, data analysis statistics and probability (abbreviated as data), geometry, measurement, and number properties and operations (abbreviated as number), and it was of interest to examine whether the students' test-taking behavior was associated with the items' content area. Thus the box plot of item RTF by content area was made. The relationship between RTF and item characteristics was further explored by plotting the RTF values against the estimated item discrimination and difficulty parameters obtained from the IRT modelse; Pearson correlation was also computed.
Conditional RTF by student's performance level
We evaluated conditional RTF, or RTF by students' performance level. Recall that each of the two samples (MST and Control) was disaggregated into low, medium and high performance levels, yielding a total of 6 subgroups (i.e., MST_Low, MST_Medium, MST_High, Control_Low, Control_Medium, and Control_High). Given the experimental design, each routing block was taken by students from all 6 subgroups, while the three second-stage blocks were each taken by students from 4 subgroups—all 3 subgroups from the Control sample and one subgroup from the MST sample that matched the block difficulty. Because student performance was confounded with their test-taking behavior, it would not be surprising if RTF based on the low performing groups were consistently lower than those based on the higher performing groups. However, conditional RTF was computed and plotted to see if, for example, any item had particularly low RTF values for some or all subgroups.
The last part of the analysis concerns the effects of data filtering on parameter estimation based on the results of behavior classification. We calibrated the items under two conditions: one included all responses, while the other only included responses connected with solution behavior (SB = 1). The two sets of item parameter estimates were then compared.
Steps involved in analysis of a data set
Conduct exploratory analysis with RT data to examine the properties of RTs.
- 2.Define RT thresholds. We considered two approaches in this study:
Visual inspection of RT distributions with conditional P + information (VITP)
Normative threshold method
Conduct three validity checks to ensure reasonable RT thresholds. (The following two steps are performed with the validated RT thresholds).
Classify behaviors by defining the SB index, an indicator for solution behavior at the person-item level. Evaluate the two-way SB table at the person level (RTE), the item level (RTF), and the item-by-group level (conditional RTF), and relate these measures to student and item characteristics.
Perform data filtering by excluding responses with SB = 0 from IRT calibration. Examine the effects of data filtering on the calibration results.
By examining how RTs distributed in the two student samples (MST and Control) and in the further disaggregated performance groups, we had the following observations: First, the RT distributions for all routing-block items were similar between the student samples. This observation is expected given that the two student samples were randomly equivalent and the students within each sample were also randomly assigned to either routing block. Second, the RT distributions for items in each second-stage block were somewhat different between the student samples. The differences were mainly at the lower tail of the RT distributions and most noticeable for the Hard-block items, resulting from differences among the performance groups. In particular, the RT distributions for higher-performing students had a much shorter left tail than did the RT distributions for lower-performing students. These observations are not surprising as by design, each block at the second stage was taken by students of varying performance levels in the MST and the Control samples. Finally, only about two thirds of the items revealed an apparent bimodal shape as in Figure 1, and the rest had a unimodal RT distribution with a heavy left tail at short RTs. When the bimodal RT pattern appeared, it was primarily for items closer to the end of a block or for items taken by students whose performance level was low relative to the difficulty of the items/block.
Defining RT thresholds
Recall that two approaches were applied to determine the RT threshold for each item: the VITP approach and the normative threshold method. As discussed earlier in this section, the RT distributions for the routing-block items were similar between the MST and Control samples, yet different for items in the second-stage blocks (especially at short RTs). Thus, we combined the two samples and produced one set of results (i.e., RT distributions, conditional P+, and average RTs) for items in the routing blocks, yielding only one set of RT thresholds for either the VITP approach or the normative threshold method for each item in the routing blocks. Conversely, for each of the Easy, Medium, and Hard blocks at Stage 2, the results of RT distributions, conditional P+, and average RTs were examined separately for the two student samples and then compared. It turns out that for items in the Easy and Medium blocks, the plausible RT thresholds were quite close between the Control and MST samples. For students receiving the Easy block, those in the Control sample tended to provide clearer cutoffs than did those in the MST sample (i.e., MST_Low), a result anticipated because the contrast between the chance level and the conditional P + above the threshold was more distinct for the Control sample than for MST_Low. The same phenomenon was observed for the Medium block. For all items in the Hard block, the RT distributions for MST_High had a much shorter left tail than did the RT distributions for the Control sample; the extended left tail for the Control sample also had chance-level conditional P+. Thus, for both threshold-identification methods, we only defined one set of RT thresholds for items in the second-stage blocks, mostly based on the RT distribution and P + information for the Control sample.
In the end, under the VITP approach, we followed a 5-, 10-, and 20-second rule to define thresholds for the multiple-choice items. Table 2 provides a summary. Among the 74 multiple-choice items, 64 had an RT threshold identified, and about 42 (i.e., two thirds of the 64 items) had a bimodal RT distribution as in Figure 1. Ten of the 74 items did not have an RT threshold identified based on the VITP approach, suggesting no clear rapid-guessing behavior observed on those items. A 10-second threshold was found adequate for most of the items. In terms of item context and reading load, the two items with a 5-second threshold had a very short stem and the calculation was straightforward—e.g., solving a linear equation or polynomial given the value of the unknown variable. Compared to items with a 10-second threshold, the items with a 20-second threshold tended to include tables and/or figures, involve more text and more complex stems, or require the students to compare the options with the materials provided in the stems.
For comparison, Table 2 also shows the thresholds identified by the normative threshold method. For some of the RT distributions in the MCBS, the 10-second ceiling appeared too conservative, as the gap that clearly separated the bimodal distributions appeared beyond 10 seconds. In different assessments in which the items have varying time demands, the 10-second ceiling may need to be modified to identify rapid-guessing behavior in an effective manner while maintaining the accurate meaning of rapid guessing. In addition, this method assigned a threshold automatically to every item, even if there was no evident rapid-guessing behavior on some items according to the timing and response data.
Next, for items with an RT threshold identified, the SB index was computed for each student-item pair. Among the student-item pairs with non-missing RTs, 98.9% were identified as exhibiting solution behavior for the VITP approach and 99.7% were identified for the normative threshold method. Both percentages are much higher than the percentages of solution behavior reported for other low-stakes assessments (e.g., 94.2% in Wise and DeMars ; 89.7% in Wise et al. ). Recall that this study concerns general rapid-guessing behavior, which can result from disengagement with the test/items and test speededness. Although long RTs are usually connected with high engagement with the items, they may also be a result of low engagement with the items as well as distraction by unrelated activities—which is less likely to be identified by the behavior-classification methods based on RTs and responses. Thus, the percentage of low engagement with the test/items with short RTs cannot exceed the percentage of rapid-guessing behavior.
We first reviewed the actual items and related the identified thresholds to the reading load and complexity of individual items with respect to the mode and median of the RT distributions and whether there were additional materials (tables or figures) to process (see the discussion in "defining RT thresholds"). For the second validity check, we evaluated the overall response accuracy rate conditional on the SB index: For the VITP approach, the P + for SB = 1 and the P + for SB = 0 were equal to 0.576 and 0.207, respectively–-the P + under solution behavior was almost three times the P + under rapid-guessing behavior, with the latter close to the chance level. The values for the normative threshold method were 0.568 for SB = 1 and 0.176 for SB = 0, both of which were smaller than the respective value under the VITP approach and the value for SB = 0 was farther away from 0.2.
Response-time effort (RTE) for students
The RTE index for a student was defined as the average of the SB index across items taken by the student. Among the 8,401 students, about 85% had an RTE value equal to 1, which means those students were classified as showing solution behavior consistently throughout the test. Roughly 14% of the students were found to engage in solution behavior for at least 80% (but less than 100%) of the items they received. There were merely 0.9% of the students with RTE smaller than 0.8; the minimal RTE was equal to 0 (this student, who was in the MST group, spent 1.7 minutes on the routing block and 2.3 minutes on the Easy block). The reliability estimate of the RTE index was roughly 0.76, an acceptable level of reliability for the classification of behaviors.
Results of all three measures of association indicated a very weak positive correlation between RTE and the responses to either survey questions listed in the Methods section. For Question 1, the estimated Pearson correlation, Spearman's rho, and Goodman and Kruskal's gamma were equal to 0.082, 0.076, and 0.161, respectively. For Question 2, the corresponding results were 0.006, 0.026, and 0.061. For a strong positive correlation to be observed for either question given that the RTE values were so skewed to the right, the marginal distribution of the responses should also have a clear decreasing trend from 4 (very important/tried much harder) to 1 (not very important/tried not as hard), possibly with a distinct proportion of students choosing 4. However, it is not the case in the MCBS: For Question 1, the percentages were 27.9% for "very important", 38.4% for "important", 26.8% for "somewhat important", and 6.3% for "not very important". For Question 2, the percentages were 6.4% for "tried much harder", 16.8% for "tried harder", 57.7% for "tried about as hard", and 18.6% for "tried not as hard".
Response-time fidelity (RTF) for items
Conditional RTF by Students' Performance Level
Effects of data filtering based on the results of behavior classification
Given that only 1.1% of the responses in the entire data set were identified with rapid-guessing behavior, excluding those rapid guesses from IRT modeling should not have a substantial impact on item calibration. This expectation was confirmed. The estimated discrimination and guessing parameters were almost identical with and without removing the rapid guesses from IRT modeling. The difference between the estimated difficulty parameters was less than 0.06 for all items but one. Although the resulting differences in item parameter estimates were not substantial in magnitude, the data filtering appeared to have the greatest impact on the difficulty parameter estimates for items in the Easy block among the five blocks in the MCBS (see Wise and DeMars , for similar findings with a different test format).
Conclusions and discussion
This paper presents a five-step procedure to examine students' test-taking behaviors using RTs collected from a standard NAEP assessment setting. Instead of relying on students' self evaluation or additional costs in test design, the paper provides a way to address the issue of rapid-guessing behavior, including the long-term question about NAEP students' extent of engagement, and the impact. Further, establishing the analysis procedure based on the NAEP program data ensures that the procedure is readily applicable to future standard operational NAEP assessments when timing data become available. The procedure is also applicable to other tests with detailed timing data.
In the MCBS, the validity checks offered compelling evidence that the VITP approach was effective in separating rapid-guessing behavior from solution behavior and outperformed the normative threshold method. A very low percent of rapid-guessing behavior was identified in our data, as compared to existing results for different assessments. For this particular dataset, rapid-guessing behavior had minimum impact on parameter estimation in the IRT modeling. However, the students clearly exhibited different behaviors when there were discrepancies between their performance level and block difficulty. Especially, the students performing worse in the routers were less likely to show rapid-guessing behavior when routed to the Easy block than when routed to the Medium or the Hard block. We also found a disagreement between RTE and the students' self reports, but whether the disagreement was related to how the students interpreted the background questions is not an easy question to answer based on the observed data.
Test-taking behaviors are tied to specific testing conditions. It is known that in low-stakes assessments, students from higher grades tend to exhibit more rapid-guessing behavior than those from lower grades (Ma et al. ; Wise et al. ), and mathematics items tend to solicit less rapid-guessing behavior compared to reading items (Wise et al. ). We expect to obtain similar findings in other NAEP mathematics assessments delivered to representative 8th graders on computers with the same test settings, although there are limitations in generalizing the findings to NAEP assessments for different grades and subjects and to different assessments. The purpose of classifying behaviors is to identify responses associated with rapid-guessing behavior. Steps 4 and 5 of the procedure examine two possible applications of the classification results. One is to relate the three measures to student and item characteristics (step 4), which can inform test design and item development. For instance, items with low RTF may be reviewed and modified by test developers for future use. The other is to filter out responses associated with SB = 0 to mitigate the impact of rapid-guessing behavior on the estimation of IRT model parameters (step 5), which is important when the parameter estimates with and without filtering differ considerably. They are feasible applications for testing programs, including NAEP, to consider in practice.
In contrast to existing methods on this topic, the VITP approach is likely to show more advantages when more items have unclear bimodal patterns, primarily because incorporating response accuracy into the classification process can be quite informative. For testing programs with large item pools, implementing the VITP approach could be practically challenging, as the RT thresholds need to be identified item by item. One possible solution is to start with an item pool of feasible size with representative items to establish a baseline for threshold identification using the VITP approach, and then scale up to operational item pools using more automated methods.
There are other promising applications beyond the scope of discussion in this paper. For example, Setzer et al. () investigated the variation in log-odds of solution behavior using a three-level random intercept hierarchical generalized linear model, and examined how RTF correlated with factors such as item difficulty, item position, etc. using regression. In the field of survey research, Couper and Kreuter () modeled the RT of a question as a function of item characteristics, respondent characteristics, and interviewer characteristics.
Caution is needed in employing the procedure and interpreting the results. First, the results of behavior classification rely on the RT thresholds and are not meaningful unless the identified thresholds are validated. Second, the procedure is employed to identify general rapid-guessing behavior, not specific to rapid guessing due to disengagement with the test or due to test speededness. We did not further differentiate the two types of rapid guessing because of the very low percent of SB = 0, implying an even smaller percent due to either type. In applications in which it is necessary to distinguish one type from the other, one may examine the total section/test time of the students with lower RTE as a rough estimate of time pressure. Last but not least, the line of research we followed assumes that students with rapid-guessing behavior show very short RTs. Short RTs alone do not indicate rapid guesses because students with pre-knowledge on items may have short RTs but high accuracy. However, long RTs might result from no engagement with the items and distraction by unrelated activities, rather than high engagement. Our analysis may be supplemented with process data gathered from log files to gain more comprehensive understanding about individual test-taking behaviors.
aWe use the term "performance level" rather than "ability level" because the MCBS was low-stakes, and hence the students' performance on the MCBS does not necessarily reflect their true ability (depending on whether they give the best effort).
bThe MCBS received OMB approval (OMB# 1850-0790 v.28).
cIf a student never clicked on an item, the RT was missing. Our analysis was based on non-missing RTs.
dOne may argue that defining RT thresholds based on visual inspection is subjective to some extent. Different raters may arrive at RT thresholds that have low exact agreement but are close (e.g., Ma et al. ). Our experience indicates that the patterns of conditional P + are typically quite different under the two behaviors–-staying consistently above the chance level under solution behavior versus fluctuating around the chance level under rapid-guessing behavior. So visually inspecting the RT distributions in conjunction with conditional P + information may not be as subjective as one might think. Conversely, as noted in the Background section, existing studies have shown little gain in classifying behaviors with operational data using more objective but sophisticated methods such as mixture models and non-parametric models (Kong et al. ; Ma et al. ). Thus, our approach is reasonable, although not fully objective, as a new application to NAEP data.
eThe responses classified as rapid guesses were excluded from the estimation of the IRT item parameters.
YHL and YJ reviewed the literature, designed and conducted the analyses, and prepared the manuscript and revisions. Both authors read and approved the final manuscript.
The authors gratefully acknowledge the significant contributions of the NAEP Design and Analysis Committee, a subgroup of whom reviewed results from the studies and offered helpful recommendations in February 2011, October 2011, and June 2012.
Foremost, we would like to thank Ruopei Sun and Jonathan Guglielmon from the Center for Data Analysis Research at ETS for conducting a large portion of the analyses presented herein. In addition, this research benefited from comments from Andreas Oranje, Shelby Haberman, Matthias von Davier, Tim Moses, Joanna Gorin, and two anonymous referees.
This research was funded by the National Center for Education Statistics (NCES) within the Institute of Education Sciences (IES) of the U.S. Department of Education under Contract Award No. ED-07-CO-0107. The opinions expressed in this paper are solely those of the authors and not of ETS, NCES, or any of their affiliates.
- Bolt DM, Cohen AS, Wollack JA: Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement 2002, 39: 331–348. 10.1111/j.1745-3984.2002.tb01146.xView ArticleGoogle Scholar
- Braun H, Kirsch I, Yamamoto K: An experimental study of the effects of monetary incentives on performance on the 12th-grade NAEP Reading assessment. Teachers College Record 2011, 113(11):2309–2344.Google Scholar
- Couper M, Kreuter F: Using paradata to explore item level response times in surveys. Journal of the Royal Statistical Society, A 2013, 176(Part 1):271–286. 10.1111/j.1467-985X.2012.01041.xView ArticleGoogle Scholar
- Cronbach LJ: Coefficient alpha and the internal structure of tests. Psychometrika 1951, 16(3):297–334. 10.1007/BF02310555View ArticleGoogle Scholar
- DeMars CE: Changes in rapid-guessing behavior over a series of assessments. Educational Assessment 2007, 12(1):23–45. 10.1080/10627190709336946View ArticleGoogle Scholar
- Hadadi A, Luecht RM: Some methods for detecting and understanding test speededness on timed multiple-choice tests. Academic Medicine 1998, 73(10):47–50. 10.1097/00001888-199810000-00042View ArticleGoogle Scholar
- Hauser C, Kingsbury GG: Individual score validity in a modest-stakes adaptive educational testing setting. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA; 2009.Google Scholar
- Karabatsos G: Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education 2003, 16: 277–298. 10.1207/S15324818AME1604_2View ArticleGoogle Scholar
- Kong X, Wise SL, Harmes JC, Yang S: Motivational effects of praise in response-time-based feedback: A follow-up study of the effort-monitoring CBT. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Francisco, CA; 2006.Google Scholar
- Kong X, Wise SL, Bhola DS: Setting the response time threshold parameter to differentiate solution behavior from rapid-guessing behavior. Educational and Psychological Measurement 2007, 67: 606–619. 10.1177/0013164406294779View ArticleGoogle Scholar
- Lee Y-H, Chen H: A review of recent response-time analyses in educational testing. Psychological Test and Assessment Modeling 2011, 53(3):359–379.Google Scholar
- Lee Y-H, Lewis C, von Davier AA: Monitoring the quality and security of multistage tests. In Computerized multistage testing: theory and applications. Edited by: Yan D, Davier AA, Lewis C. CRC Press, New York; 2014:285–300.Google Scholar
- Ma L, Wise SL, Thum YM, Kingsbury G: Detecting response time threshold under the computer adaptive testing environment. Paper presented at the annual meeting of the National Council of Measurement in Education, New Orleans, LA; 2011.Google Scholar
- Meijer RR, Sijtsma K: Methodology review: evaluating person fit. Applied Psychological Measurement 2001, 25: 107–135. 10.1177/01466210122031957View ArticleGoogle Scholar
- Meyer PJ: A mixture Rasch model with response time components. Applied Psychological Measurement 2010, 34: 521–538. 10.1177/0146621609355451View ArticleGoogle Scholar
- National Assessment Governing Board. (2005). NAEP 12th grade participation and motivation: Preliminary recommendations. . Last accessed 8 July 2014., [http://www.nagb.org/content/nagb/assets/documents/policies/NAEP%2012th%20Grade%20Participation%20%20Motivation.pdf]
- O'Neil HF, Sugrue B, Baker EL: Effects of motivational interventions on the National Assessment of Educational Progress Mathematics performance. Educational Assessment 1995, 3(2):135–157. 10.1207/s15326977ea0302_2View ArticleGoogle Scholar
- Schnipke DL, Scrams DJ: Modeling item response times with a two-state mixture model: a new method of measuring speededness. Journal of Educational Measurement 1997, 34: 213–232. 10.1111/j.1745-3984.1997.tb00516.xView ArticleGoogle Scholar
- Schnipke DL, Scrams DJ: Exploring issues of examinee behavior: insights gained from response-time analyses. In Computer-based testing: building the foundation for future assessments. Edited by: Mills CN, Potenza M, Fremer JJ, Ward W. Lawrence Erlbaum Associates, Hillsdale, NJ; 2002:237–266.Google Scholar
- Setzer JC, Wise S, van den Heuvel J, Ling G: An investigation of examinee test-taking effort on a large-scale assessment. Applied Measurement in Education 2013, 26(1):34–49. 10.1080/08957347.2013.739453View ArticleGoogle Scholar
- van der Linden WJ: A hierarchical framework for modeling speed and accuracy on test items. Psychometrika 2007, 72: 287–308. 10.1007/s11336-006-1478-zView ArticleGoogle Scholar
- Wise S: An investigation of the differential effort received by items on a low-stakes computer-based test. Applied Measurement in Education 2006, 19(2):95–114. 10.1207/s15324818ame1902_2View ArticleGoogle Scholar
- Wise S, DeMars C: An application of item response time: the effort-moderated IRT model. Journal of Educational Measurement 2006, 43(1):19–38. 10.1111/j.1745-3984.2006.00002.xView ArticleGoogle Scholar
- Wise S, Kong X: Response time effort: a new measure of examinee motivation in computer-based tests. Applied Measurement in Education 2005, 18(2):163–183. 10.1207/s15324818ame1802_2View ArticleGoogle Scholar
- Wise SL, Ma L: Setting response time thresholds for a CAT item pool: the normative threshold method. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, Canada; 2012.Google Scholar
- Wise S, Bhola DS, Yang S: Taking the time to improve the validity of low-stakes tests: the effort-monitoring CBT. Educational Measurement: Issues and Practice 2006, 25(2):21–30. 10.1111/j.1745-3992.2006.00054.xView ArticleGoogle Scholar
- Wise V, Wise S, Bhola D: The generalizability of motivation filtering in improving test score validity. Educational Assessment 2006, 11(1):65–83. 10.1207/s15326977ea1101_3View ArticleGoogle Scholar
- Wise S, Pastor DA, Kong X: Correlates of rapid-guessing behavior in low- stakes testing: Implications for test development and measurement practice. Applied Measurement in Education 2009, 22(2):185–205. 10.1080/08957340902754650View ArticleGoogle Scholar
- Wise SL, Ma L, Kingsbury GG, Houser C: An investigation of the relationship between time of testing and test-taking effort. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO; 2010.Google Scholar
- Xu X, Oranje A, Mazzeo J, Kulick E: An adaptive approach for group-score assessments. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, British Columbia, Canada; 2012.Google Scholar
- Yamamoto K, Everson H: Modeling the effects of test length and test time on parameter estimation using the hybrid model. In Applications of latent trait and latent class models in the social sciences. Edited by: Rost J. Waxmann, Münster, Germany; 1997:89–98.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.