15 Reliability of ROAR-Word
15.1 Background: Published studies
The first published version of ROAR-Word achieved exceptional alternate form reliability (r=0.95) using fixed forms that were equated based on item response theory (Yeatman et al. 2021). To improve efficiency of ROAR-Word, Ma et al. (2023) built the first, open-source, computer adaptive testing (CAT) algorithm in Javascript, and then ran a series of experiments to study how reliability and efficiency of ROAR-Word could be improved with CAT. Figure 15.1 reproduces a figure from Ma et al. (2023) showing an experiment comparing ROAR-CAT to a standard, non-adaptive testing approach. In this experiment, participants were randomly assigned to complete ROAR-Word with the trial order controlled by either a) jsCAT (solid line) versus b) random item sampling (dotted line). ROAR-CAT achieved the same reliability in roughly 40% fewer trials.
This innovation has now been incorporated into all the ROAR measures to create quick and efficient, adaptive assessments that span broad age ranges.
15.2 Criteria for identifying disengaged participants and flagging unreliable scores
ROAR-Word is designed to be totally automated: instructions and practice trials are narrated by characters, words are read silently, responses are non-verbal, and scoring is done in real time after each response. This makes it child-friendly, eradicates issues related to inter-rater reliability, and makes it possible to efficiently assess a whole school district simultaneously. However, a concern about automated assessments is that without a teacher to individually administer items, monitor, and score responses, some students might disengage and provide data that is not representative of their true ability. One benefit of a lexical decision task is that there is an extensive literature on the expected response time distribution (Balota, Yap, and Cortese 2006; Keuleers, Lacey, and Rastle 2012; Balota et al. 2007). Based on the amount of time it takes signals from the eye to reach the brain, for the visual features to be processed, the word to be recognized, and a motor response to be initiated, extremely fast response times are most likely due to rapid guessing behavior indicative of disengagement from the assessment (Ratcliff, McKoon, and Gomez 2004; Balota, Yap, and Cortese 2006). Our previous publications have validated fast response time as an indicator of participant disengagement (Ma et al. 2023; Yeatman et al. 2021). This effect can be seen in Figure 15.2 which shows a plot of median response time (RT) versus proportion correct for each participant. None of the participants with a median response time less than 450ms (horizontal black line in Figure 15.2) are accurate on ROAR-Word. Since ROAR-Word is run as a computer adaptive test (CAT), All participants should be around 75% correct: item difficulty changes adaptively based on participant responses. Participants that respond very quickly and inaccurately are disengaged and not providing data that is representative of their true ability.
Participants with low accuracy (<65% correct) and a median response time <450ms are flagged in ROAR-Score reports and their data is excluded from analyses. Teachers can choose whether to re-administer ROAR or interpret data cautiously in relation to other data sources and contextual factors.
15.3 Reliability of computer adaptive ROAR-Word
ROAR-Word runs as computer adaptive test based on a Rasch model. The current, default version of ROAR-Word takes about 4 minutes (84 items). More items can be administered for a more precise measure or fewer items can be administered as a quick screener. Table 15.1 reports marginal reliability computed based on data from 10294 students under the IRT model for the standard, 84 item version of ROAR-Word. Reliability (\(\rho_{xx^\prime}\)) is computed based on the estimated variance of \(\hat{\theta}\) relative to the estimated standard error (\(\widehat{SE}(\hat{\theta})^2\)) using Equation 22.1:
\[ \hat{\rho}_{xx^\prime} = \frac{\widehat{VAR}(\hat{\theta})}{\widehat{VAR}(\hat{\theta}) + \widehat{SE}(\hat{\theta})^2}, \tag{15.1}\]
Grade | Empirical Reliability | N |
---|---|---|
All | 0.94 | 10294 |
K | 0.87 | 131 |
1 | 0.92 | 1050 |
2 | 0.93 | 1123 |
3 | 0.94 | 572 |
4 | 0.94 | 320 |
5 | 0.94 | 315 |
6 | 0.92 | 1000 |
7 | 0.91 | 846 |
8 | 0.92 | 716 |
9 | 0.91 | 1347 |
10 | 0.91 | 1243 |
11 | 0.91 | 932 |
12 | 0.91 | 699 |
To ensure that ROAR-Word is a fair and equitable assessment across different demographic groups we also report reliability separately by gender (Table 15.2), eligibility for free and reduced price lunch (Table 15.3), English learner status as designated by the school district (Table 15.4), primary language (Table 15.5), special education (Table 15.6), ethnicity (Table 15.7), and race (Table 15.8)
Gender | Empirical Reliability | N |
---|---|---|
All | 0.94 | 3733 |
F | 0.94 | 1800 |
M | 0.94 | 1933 |
Free/Reduced Lunch Status | Empirical Reliability | N |
---|---|---|
All | 0.94 | 1949 |
Free | 0.93 | 409 |
Paid | 0.94 | 1390 |
Reduced | 0.93 | 150 |
English Learner Status | Empirical Reliability | N |
---|---|---|
All | 0.94 | 2368 |
English Learner | 0.95 | 897 |
English Only | 0.94 | 1180 |
Initial Fluent English Proficient | 0.94 | 213 |
Reclassified Fluent English Proficient | 0.93 | 76 |
Primary Language | Empirical Reliability | N |
---|---|---|
All | 0.94 | 1916 |
English | 0.94 | 1396 |
Other | 0.92 | 188 |
Spanish | 0.92 | 332 |
Special Education Status | Empirical Reliability | N |
---|---|---|
All | 0.94 | 2046 |
No | 0.95 | 1874 |
Yes | 0.95 | 172 |
Hispanic Ethnicity | Empirical Reliability | N |
---|---|---|
All | 0.94 | 4246 |
No | 0.94 | 2580 |
Yes | 0.94 | 1666 |
Race | Empirical Reliability | N |
---|---|---|
All | 0.94 | 3187 |
Asian | 0.94 | 439 |
Black or African American | 0.93 | 37 |
Hispanic | 0.94 | 1666 |
Multiracial | 0.94 | 249 |
White | 0.95 | 776 |