17  Reliability of ROAR-Sentence

ROAR-Sentence is a timed measure and the score is computed as the number of correct trials minus the number of incorrect trials in the alloted period time window. Originally, ROAR-Sentence was 3 minutes long but (Yeatman et al. 2024) demonstrated the cutting the time in half to 90 seconds had very little impact on reliability and validity of the measure. ROAR-Sentence consists of a collection of equated test forms where sentences are presented in a fixed order. We first report our methodology for equating test forms (Section 17.1) and then report alternate form reliability (Section 17.3).

17.1 Equating ROAR-Sentence test forms

ROAR-Sentence consists of multiple parallel forms co-developed by human researchers and generative AI. Generative AI greatly reduces the time and resources required to create multiple tests forms, and (Zelikman 2023) has shown that the quality of AI-generated forms is highly comparable to those created by humans. These test forms were equated using equipercentile equating through the equate package.

Figure 17.1 a) shows the equated scores across the test forms. Equating enables ROAR-Sentence to randomly select from multiple available test forms, thereby minimizing the potential for practice effects when students encounter the same test form across different testing windows. Figure 17.2 b) provides a separate plot showing the standard error of the equipercentile equating.

Figure 17.1: Equated scores for lab and parallel AI forms
Figure 17.2: Standard error of equating

17.2 Criteria for identifying disengaged participants and flagging unreliable scores

ROAR-Sentence is designed to be totally automated: reading is done silently, responses are non-verbal, instructions and practice trials are narrated by characters, and scoring is done automatically in real time. This makes it possible to efficiently assess a whole district simultaneously. A concern about automated assessments is that without a teacher to individually administer items, monitor, and score responses, some students might disengage and provide data that is not representative of their true ability. For a measure like ROAR-Sentence where items are designed and validated to have an unambiguous and clear answer, disengaged participants can be detected based on fast and innacurate responses. Our approach to identifying and flagging disegnaged participants with unreliable scores was published in (Yeatman et al. 2024). Figure 17.3 shows a plot of median response time (RT) versus proportion correct for each participant. Most participants were very accurate (>90% correct responses). However there was a bimodal distribution indicating a small group of participants who were performing around chance. These participants also had extremely fast response times.

Criteria for flagging unreliable scores

Participants with a median response time <1,000ms AND low accuracy (<65% correct) are flagged as unreliable scores in ROAR score reports and are excluded from analyses since scores do not accurately represent the participant’s ability. Teachers can choose whether to re-administer ROAR or interpret data cautiously in relation to other data sources and contextual factors.

Figure 17.3: Criteria for identifying disengaged participants and flagging unreliable scores on ROAR-Sentence. Participants displaying extremely rapid responses performed near chance on ROAR-Sentence. This criteria is consistent across multiple studies (Yeatman et al. 2024). Black lines indicate the cut off for flagging disengaged participants with unreliable scores.

17.3 Alternate form reliability

Alternate form reliability is computed as the Pearson correlation between scores on equated test forms that were administered during the same testing session. Figure 17.4 (a) shows a plot of student scores on alternate test forms combining grades and Figure 17.4 (b) shows separate plots for each grade. Table 17.1 reports alternate form reliability for the full sample and separately by grade.

(a) Alternate form reliability across grades
(b) Alternate form reliability separately by grade
Figure 17.4: Alternate form reliability for ROAR-Sentence
Grade Alternate Form Reliability N
All 0.82 7261
1 0.80 133
2 0.90 310
3 0.90 357
4 0.85 84
5 0.77 85
6 0.82 540
7 0.80 413
8 0.80 543
9 0.75 1534
10 0.79 1300
11 0.77 1079
12 0.74 883
Table 17.1: Alternate form reliability for ROAR-Sentence by Grade
Gender Alternative Form Reliability N
All 0.82 23976
Female 0.82 12164
Male 0.78 11812
Table 17.2: Alternate form reliability for ROAR-Sentence by Gender
Free/Reduced Lunch Status Alternative Form Reliability N
All 0.82 2590
Free/Reduced 0.90 512
Paid 0.89 2078
Table 17.3: Alternate form reliability for ROAR-Sentence by FRL Status
English Language Learner Status Alternate Form Reliability N
All 0.82 3882
English Learner 0.83 492
English Only 0.88 2794
Initial Fluent English Proficient 0.86 468
Reclassified Fluency English Proficient 0.89 128
Table 17.4: Alternate form reliability for ROAR-Sentence by EL Status
Home Language Alternate Form Reliability N
All 0.82 1370
English 0.88 1006
Spanish 0.78 364
Table 17.5: Alternate form reliability for ROAR-Sentence by Home Language
Special Education Alternate Form Reliability N
All 0.81 3130
No 0.90 2924
Yes 0.67 206
Table 17.6: Alternate form reliability for ROAR-Sentence by Special Education Status
Hispanic/Latinx Alternate Form Reliability N
All 0.81 24656
No 0.80 17932
Yes 0.90 6724
Table 17.7: Alternate form reliability for ROAR-Sentence by Hispanic Ethnicity
Race Alternate Form Reliability N
All 0.82 19070
Asian 0.90 2048
Black/African American 0.70 3164
Multiracial 0.79 8036
White 0.87 5822
Table 17.8: Alternate form reliability for ROAR-Sentence by Race

References

Yeatman, Jason D, Jasmine E Tran, Amy K Burkhardt, Wanjing A Ma, Jamie L Mitchell, Maya Yablonski, Liesbeth Gijbels, Carrie Townley-Flores, and Adam Richie-Halford. 2024. “Development and Validation of a Rapid and Precise Online Sentence Reading Efficiency Assessment.” Frontiers in Education 9: 1494431. https://doi.org/10.3389/feduc.2024.1494431.
Zelikman, Wanjing and Tran, Eric and Ma. 2023. “Generating and Evaluating Tests for K-12 Students with Language Model Simulations: A Case Study on Sentence Reading Efficiency.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, edited by Juan and Bali Bouamor Houda and Pino, 2190–2205. Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.135.