### Search strategy

To conduct a thorough literature search, four strategies were employed: (a) academic database, (b) Internet browsing, (c) backward and forward citation; and (d) expert consultation searches. Data collection was conducted by the second author and completed on July 31, 2020. A full description of each search strategy is provided below, presented in the order conducted.

#### Academic database search

The first approach employed to locate relevant articles consisted of searching the following academic databases: PsycINFO (via Ovid); ERIC (via EBSCOhost); Education Source (via EBSCOhost); and Academic Search Premier (via EBSCOhost). These databases covered journal articles across multiple fields, such as psychology (PsycINFO), education (ERIC and Education Source), and statistics (Academic Search Premier). The key terms employed across databases were “rapid guess” and “response time”. As there were multiple formats of the term “guess” (e.g., guess, guesses, guessing), the term “rapid guess” was entered with the Boolean modifier Asterisk (“*”) to be used as a root word. To narrow the search and improve accuracy, the key terms “rapid guess*” and “response time” were entered with the Boolean operator “AND”. Additionally, only studies published in the English language were included. No other initial restrictions were placed on the search.

#### Internet browsing

An Internet search was conducted via Google Scholar to strengthen the coverage of grey literature not included in the academic databases noted above. The key terms used in the Internet search were identical to those used in the academic database search. Results produced in Google Scholar were sorted by relevance (little information is known on how Google sorts its hits; Haddaway et al., 2015). Although the first 1000 results were accessible, the return rate of relevant articles continued to decrease as we progressed through the results. For example, only 12 items (approximately 3.3%) were potentially relevant to the topic of current review from the 301st to the 660th result. Due to this low hit rate, our search consisted of the first 660 results.

#### Expert consultation

This search strategy was conducted by directly contacting the following researchers known to have conducted work and/or published extensively on the topic of RG: Steve Wise (Northwest Evaluation Association, USA), Megan Kuhfeld (Northwest Evaluation Association, USA), Sara Finney (James Madison University, USA), Dena Pastor (James Madison University, USA), Jim Soland (University of Virginia), and Brandi Weiss (George Washington University, USA). Each individual was contacted via email to ascertain whether they had conducted unpublished research that met our inclusion criteria and/or knew of such research authored by others. All communication and article retrieval was completed by July 1, 2020.

#### Citation searching

Beyond the search strategies described above, backward citation searching was also included. This was done by searching the reference lists of two pertinent review articles (Silm et al., 2020; Wise, 2017) identified by the first author as well as all articles found to meet our eligibility criteria (described below) from the academic database, Internet, and expert consultation searches. This search was conducted using both Social Sciences Citation Index and Google Scholar. All studies found using backward citation searching were then evaluated based on the eligibility criteria, and if met, were included for additional referencing.

Upon completing the backward citation search, forward citation searching (i.e., searching for studies that cited the manuscript of interest) was employed to examine studies that met the eligibility criteria from the search strategies noted above. This was done by typing in the title for the study of interest into Google Scholar. Upon finding the article of interest, the *cited by* link was clicked on, which allowed for the ability to search studies that cited the article of interest. Any study included from this strategy also underwent backward and forward citation searching. This process was repeated until no new articles met the eligibility criteria. Both citation search strategies were completed by July 31, 2020.

### Eligibility criteria

To be included in this meta-analysis, studies had to meet the eligibility criteria set forth along three dimensions: (a) data type; (b) RG identification methodology; and (c) outcomes.

#### Data type

Only studies that utilized empirical data to study RG threshold procedures were included. Empirical data could have been obtained from either primary or secondary data collections from an unspeeded, group-administered classroom, formative, or accountability test that was low-stakes and computer-administered. The choice of only including low-stakes power or unspeeded tests was to ensure that RG largely reflected test disengagement rather than test speededness.^{Footnote 5} No further restrictions were placed on examinee (e.g., age, country of origin, ethnicity, language) nor assessment (e.g., content area, length, item types) characteristics. However, data obtained from simulation studies were excluded.

#### RG identification methodology

Although there are multiple proxies for identifying RG (e.g., self-report measures, person-fit statistics), this meta-analysis only included studies that utilized RT threshold procedures. To be included, studies had to investigate RG based on two or more threshold approaches. These thresholds could either be a variant (e.g., using different percentiles) of one or two of the following procedures: (a) surface feature; (b) common k-second; (c) percentile; (d) bimodal distribution; and (e) response time and accuracy information procedures.

#### Outcomes

The outcomes of interest were differences between RT thresholds on three categories of variables: (a) descriptives; (b) measurement properties; and (c) performance. For a study to be included, it must have presented quantitative results on one or more of these outcomes. A quantitative result was defined as any test statistic (e.g., χ^{2}, Z, t, F, \(\widehat{p}\)) necessary for computing a Cohen’s *d* (standardized difference), Cohen’s *h* (difference in proportions), or a correlation effect size (more detail on these effect size calculations are included below).

Within the descriptive category of variables, we coded for differences in the proportion/percentage of RG responses identified and proportion/percentage of examinees engaging in RG. Concerning the latter, if not explicitly classified in the original article, 0.9 was set as the cut-score utilizing the response time effort index (RTE; i.e., the proportion of responses not identified as RG) to classify motivated and unmotivated test takers. This cut-off (i.e., an unmotivated examinee employed RG on 10% or more of items administered) was first proposed by Wise and Kong (2005) and has since been used extensively in applied research (e.g., Rios et al., 2014). Consequently, the RG examinee proportion rate was calculated by subtracting the percentage of the participants with RTE equal to or greater than 0.9 from 1.

In regard to measurement properties, we were interested in how the choice of RT threshold procedure was associated with differences in: (a) average item difficulty (measured using proportion correct and/or IRT calibration estimates) after removing RG responses or examinees; and (b) average item discrimination (measured using an item-total correlation and/or IRT calibration estimates) after removing RG responses or examinees.

The last outcome variable of interest was the difference in the average sample performance between RT thresholds. This variable could be reported as the mean raw, scale, or theta estimate score for the total sample, after removing RG responses or examinees. This level of aggregation was chosen given that most low-stakes educational tests report group-level performance for monitoring and accountability efforts.

### Variable coding

Five variables were identified as potential factors that could account for differences in the outcomes of interest: (a) examinee age; (b) test subject; (c) test length; (d) threshold typology pairing; and (e) threshold pairing variability. The first three variables were included as they have been found to moderate the extent of RG observed in operational testing (meaning more potential variability in the number of RG responses), while the last two variables were the main independent variables of interest. A detailed description of each variable is presented below.

#### Age

RG has been shown to vary across age groups, with older examinees more likely to engage in unmotivated behaviors (Goldhammer et al., 2016). As a result, the average age of the sample was coded. Operationally, if a primary author did not provide the sample’s average age, examinee grade-level was used as a proxy. As an example, age 6 was used for 1st-year primary school students, 13 was coded for 8th-grade students, and 18 was utilized for college freshmen. Additionally, if examinees were reported to come from a range of grades, the midpoint was used for the group. For example, age 20 was imputed for a sample of undergraduate college students. If neither age nor grade were provided, this variable was coded as missing.

#### Test subject

Prior research has suggested that test subject can moderate participants' test taking effort. As an example, Kiplinger and Linn (1994) found that more than half of students in both grade 8 and grade 12 reported significantly more effort in taking math tests than in other subjects. To account for this potential moderation, we dichotomously coded for a test’s subject as either a “STEM or mixed subject” or “non-STEM” subject. Mixed subject tests were defined as those that included both STEM and non-STEM content.

#### Test length

Test takers have been shown to exhibit more disengaged responses as a test grows in length, potentially due to issues of cognitive fatigue (e.g., Wise & Kingsbury, 2016). As a result, test length was considered as a moderator that might impact the identification of RG responses, given that longer tests may be associated with greater variability in RG.

#### Threshold typology pairing

As described earlier, there are three main typologies of RT thresholds, based on the utilization of: (a) NED; (b) RT; and (c) RTRA. In the present study, this led to six different comparison pairings, three for variants of RT thresholds found within the same typology (e.g., comparison of variants within the NED typology) and the remaining three for RT thresholds that differed between typologies. Specifically, each RT threshold fell into one of the following pairings: (a) NED–NED; (b) RT–RT; (c) RTRA–RTRA; (d) NED–RT; (e) NED–RTRA; and (f) RT–RTRA.

#### Threshold pairing variability

To account for variance in RT threshold comparisons, a dummy-coded variable was created to signify whether a comparison was within or between threshold typology/typologies (within served as the reference).

### Interrater agreement

Interrater reliability was computed for three distinctive stages: title and abstract screening, full-text reviewing, and variable coding. Rayyan (https://rayyan.qcri.org) was employed for interrater reliability coding of the title and abstract and full-text review phases, while Excel was used for variable coding. Prior to coding for each stage, the principal investigator provided training to the second author (a Ph.D. student in educational measurement) that was comprised of discussing the objectives and evaluation criteria of the stage, reviewing each variable’s operational definition, and engaging in joint coding of a small percentage of articles. Upon completion of training, the second author was responsible for all coding across stages, while the first author coded 20% of articles at each stage to evaluate interrater reliability. An interrater agreement value of 0.80 was set as the criterion for establishing rater consistency. Any inconsistent decisions across raters were resolved through discussion and consensus. For the first stage (i.e. title and abstract screening), the two authors were in high agreement on article inclusion with an interrater agreement of 0.91. For the other two review stages, no conflicts were presented, with the interrater agreement equal to 1.

### Statistical methods

The sections that follow describe the procedures for: (a) calculating effect sizes; (b) evaluating publication bias; (c) identifying outliers; (d) estimating average effect sizes and effect size heterogeneity; and (e) performing moderator analyses.

#### Calculating effect sizes

Effect sizes were calculated separately for continuous, proportional, and correlational data. Concerning the former, continuous data were presented for the following outcome variables: IRT item discrimination parameter estimates, IRT item difficulty parameter estimates, and group test performance. To calculate the effect sizes for these variables, Cohen’s *d* formula was used:

$$d = \left| {\frac{{ \overline{M}_{2} - \overline{M}_{1} }}{{\sqrt {\frac{{\left( {n_{2} - 1} \right)S_{2}^{2} + \left( {n_{1} - 1} \right)S_{1}^{2} }}{{n_{1} + n_{2} - 2}}} }}} \right|,$$

(1)

where \({M}_{1}\) and \({M}_{2}\) are sample means for threshold 1 and threshold 2 respectively, \({n}_{1}\) and \({n}_{2}\) are sample sizes for threshold 1 and threshold 2 respectively, \({S}_{1}\) and \({S}_{2}\) are the standard deviations of outcomes for threshold 1 and threshold 2 respectively. As the direction of outcome was not of interest, an absolute value of the effect size was computed. Furthermore, the variance of Cohen’s *d* was calculated as:

$${v}_{d}=\frac{{n}_{1}+{n}_{2}}{{n}_{1}{n}_{2}}+\frac{{d}^{2}}{2\left({n}_{1}+{n}_{2}\right)},$$

(2)

where *d* is the absolute value of Cohen’s *d* calculated from formula (1) above. The computation of Cohen’s *d* was completed in the *R* package *compute.es* (Del Re, 2020).

Proportional data were reported for the following dependent variables: (a) proportion of examinees engaging in RG; (b) proportion of responses identified as RG; and (c) CTT item difficulty values (i.e., proportion of correct responses) for effortful test takers. As the power to detect differences in proportions is dissimilar across studies due to unequal sample sizes (Cohen, 1988), a nonlinear transformation, defined as \(\varphi\), was applied to provide equal detectability of outcomes. Given the estimated proportions (\(\widehat{p}\)) of two thresholds on an outcome of interest, \(\varphi\) is computed via the formula:

$$\varphi = 2arcsin \sqrt {\hat{p}} .$$

(3)

This transformation was then used to calculate an effect size for differences between proportions for a given threshold pairing using Cohen’s *h* formula:

$$h = \left| {\varphi_{1} - \varphi_{2} } \right|.$$

(4)

Similar to Cohen’s *d*, the absolute value was computed for Cohen’s *h* as no directional assumptions were made. However, unlike Cohen’s *d* formula, a variance estimate is not readily available for Cohen’s *h*. Thus, to obtain some measure of variability, heterogeneity was demonstrated via the standard deviation of effect sizes. For interpretation purposes, Cohen’s (1988) guidelines were adopted in which an *h* value between 0.2 and 0.5 indicates a small effect size, an *h* value between 0.5 and 0.8 represents a medium effect size, and an *h* value greater than 0.8 reflects a large effect size. Across proportional outcome variables, Cohen’s *h* was calculated using the *pwr* package in *R* (Champely, 2020).

Finally, correlation coefficients were reported for the average CTT item discrimination (item-total correlations). Although the correlation coefficient can serve as an effect size on its own, Fisher’s *z* transformation was applied to every correlation to normalize the sampling distribution using the *metafor* package in R (Viechtbauer, 2020). This transformation was applied as:

$$z=0.5*\mathrm{ln}\left(\frac{1+r}{1-r}\right),$$

(5)

and the variance of the transformation was calculated as:

$$v_{z} = \frac{1}{n - 3}.$$

(6)

Then the effect size difference for each threshold pair was calculated using Cohen’s *q* index (Cohen, 1988):

$$q=\left|{z}_{1}-{z}_{2}\right|,$$

(7)

while the variance for this index was calculated as:

$$var\left(q\right)=\frac{1}{{N}_{1}-3}+\frac{1}{{N}_{2}-3},$$

(8)

where \({N}_{1}\) and \({N}_{2}\) are the sample sizes based on the correlation for threshold 1 and threshold 2, respectively. For this effect size, Cohen (1988) proposed the following categories for interpreting *q values*: no effect: < 0.10; small effect: 0.10–0.29; moderate effect: 0.30–0.50; and large effect > 0.50.

#### Estimating average effect sizes and evaluating effect size heterogeneity

Prior to estimating average effect sizes and effect size heterogeneity, the effect sizes of each outcome were diagnosed for potential outliers. Outliers were defined as any estimated effect size greater than three standard deviations (based on the absolute value) from the mean effect size of the given outcome. To avoid the loss of data, any identified outlier was down-weighted to a value equal to three standard deviations from the mean. A sensitivity analysis was then conducted to evaluate the impact of the identified outliers on the estimation of the mean effect sizes for each dependent variable. If any inflation or deflation of the mean effect size under study was observed, the adjusted effect size estimates were used for all subsequent analyses.

For continuous and correlational outcome variables, an intercept-only random-effects meta-regression model was run in the *robumeta* package in *R* (Fisher et al., 2017) to calculate the average effect size and effect size heterogeneity. To avoid artificially reducing variance estimates and inflating Type I error due to effect size dependencies (i.e., multiple effect sizes are produced by comparing various response time thresholds from the same study; Borenstein et al., 2009), the robust variance estimation (RVE) procedure developed by Hedges et al. (2010) was employed. The heterogeneity of effect sizes was investigated using the *I*^{2} statistic:

$${I}^{2}=\left(\frac{Q-k}{Q}\right)\times 100\mathrm{\%},$$

(9)

where \(Q\) is a homogeneity statistic that represents the degree that random-effect variance is significantly different from 0, and \(k\) is the number of studies. Higgins and Thompson (2002) proposed effect size guidelines for this statistic, with *I*^{2} values less than 50% indicating small heterogeneity, values between 50% and 75% representing medium heterogeneity, and values greater than 75% reflecting large heterogeneity.

As variance estimates for dependent variables that reported only proportional data were not available, classical approaches to calculating average effect sizes and heterogeneity were taken. These consisted of respectively computing the mean and standard deviation of the effect sizes for the outcome under investigation.

#### Moderator analyses

For continuous and correlational data, moderator analyses were conducted for any outcome that was found to have a large degree of heterogeneity.^{Footnote 6} This was done by estimating the following random-effects meta-regression model:

$$\widehat{y}\,=\,{b}_{0}+{b}_{1}\left(age\right)+{b}_{2}\left(test \,subject\right) +{b}_{3}\left(test\, length\right)+ {{b}_{4} \left(RT-RT\right)+b}_{5}\left(RTRA-RTRA\right)+{b}_{6}\left(NED-RT\right)+ {{b}_{7}\left(NED-RTRA\right)+b}_{8}\left(RT-RTRA\right)+{ b}_{9}\left(threshold\, pairing\, variability\right)+e,$$

(10)

where \(\widehat{y}\) was equal to one of the continuous or correlational outcome variables of interest (test performance, IRT item discrimination parameter, IRT item difficulty parameter, item-total correlation), \({b}_{0}\) was equal to the average effect size for the outcome variable holding all included variables constant, *age* and *test length* were continuous variables, *test subject* was coded dichotomously as “non-STEM subject” or “STEM or mixed subject” (reference group), \({b}_{4}\) through \({b}_{8}\) were dummy-coded variables for RT–RT, RTRA–RTRA, NED–RT, NED–RTRA and RT–RTRA (NED–NED served as the reference group), *threshold pairing variability* was coded as between threshold typologies or within a threshold typology (reference group), and *e* was the residual term. The moderator analyses were conducted in the *R* package *metafor* (Viechtbauer, 2020).