16 Reliability of ROAR-Sentence

ROAR-Sentence is a timed measure and the score is computed as the number of correct trials minus the number of incorrect trials in the alloted period time window. Originally, ROAR-Sentence was 3 minutes long but (Yeatman et al. 2024) demonstrated the cutting the time in half to 90 seconds had very little impact on reliability and validity of the measure. ROAR-Sentence consists of a collection of equated test forms where sentences are presented in a fixed order. We first report our methodology for equating test forms (Section 16.1) and then report alternate form reliability (Section 16.3).

16.1 Equating ROAR-Sentence test forms

ROAR-Sentence consists of multiple parallel forms co-developed by human researchers and generative AI. Generative AI greatly reduces the time and resources required to create multiple tests forms, and (Zelikman 2023) has shown that the quality of AI-generated forms is highly comparable to those created by humans. These test forms were equated using equipercentile equating through the equate package.

Figure 16.1 a) shows the equated scores across the test forms. Equating enables ROAR-Sentence to randomly select from multiple available test forms, thereby minimizing the potential for practice effects when students encounter the same test form across different testing windows. Figure 16.2 b) provides a separate plot showing the standard error of the equipercentile equating.

Figure 16.1: Equated scores for lab and parallel AI forms

16.2 Criteria for identifying disengaged participants and flagging unreliable scores

ROAR-Sentence is designed to be totally automated: reading is done silently, responses are non-verbal, instructions and practice trials are narrated by characters, and scoring is done automatically in real time. This makes it possible to efficiently assess a whole district simultaneously. A concern about automated assessments is that without a teacher to individually administer items, monitor, and score responses, some students might disengage and provide data that is not representative of their true ability. For a measure like ROAR-Sentence where items are designed and validated to have an unambiguous and clear answer, disengaged participants can be detected based on fast and innacurate responses. Our approach to identifying and flagging disegnaged participants with unreliable scores was published in (Yeatman et al. 2024). Figure 16.3 shows a plot of median response time (RT) versus proportion correct for each participant. Most participants were very accurate (>90% correct responses). However there was a bimodal distribution indicating a small group of participants who were performing around chance. These participants also had extremely fast response times.

Criteria for flagging unreliable scores

Participants with a median response time <1,000ms AND low accuracy (<65% correct) are flagged as unreliable scores in ROAR score reports and are excluded from analyses since scores do not accurately represent the participant’s ability. Teachers can choose whether to re-administer ROAR or interpret data cautiously in relation to other data sources and contextual factors.

Figure 16.3: Criteria for identifying disengaged participants and flagging unreliable scores on ROAR-Sentence. Participants displaying extremely rapid responses performed near chance on ROAR-Sentence. This criteria is consistent across multiple studies (Yeatman et al. 2024). Black lines indicate the cut off for flagging disengaged participants with unreliable scores.

16.3 Alternate form reliability

Alternate form reliability is computed as the Pearson correlation between scores on equated test forms that were administered during the same testing session. Figure 16.4 (a) shows a plot of student scores on alternate test forms combining grades and Figure 16.4 (b) shows separate plots for each grade. Table 16.1 reports alternate form reliability for the full sample and separately by grade.

(a) Alternate form reliability across grades

Grade	Alternate Form Reliability	N
All	0.82	7186
1	0.80	133
2	0.89	316
3	0.90	364
4	0.85	84
5	0.78	83
6	0.82	540
7	0.80	413
8	0.80	543
9	0.75	1468
10	0.79	1280
11	0.77	1079
12	0.74	883

Table 16.1: Alternate form reliability for ROAR-Sentence

Race/Ethnicity	N	Correlation
Asian	664	0.889
Hispanic	64	0.946
Multiracial	398	0.890
White	730	0.905

Table 16.2: Alternate form reliability for ROAR-Sentence by Race/Ethnicity

References

Yeatman, Jason D, Jasmine E Tran, Amy K Burkhardt, Wanjing A Ma, Jamie L Mitchell, Maya Yablonski, Liesbeth Gijbels, Carrie Townley-Flores, and Adam Richie-Halford. 2024. “Development and Validation of a Rapid and Precise Online Sentence Reading Efficiency Assessment.” Frontiers in Education 9: 1494431. https://doi.org/10.3389/feduc.2024.1494431.

Zelikman, Wanjing and Tran, Eric and Ma. 2023. “Generating and Evaluating Tests for K-12 Students with Language Model Simulations: A Case Study on Sentence Reading Efficiency.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, edited by Juan and Bali Bouamor Houda and Pino, 2190–2205. Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.135.