16 Reliability of ROAR-Sentence
ROAR-Sentence is a timed measure and the score is computed as the number of correct trials minus the number of incorrect trials in the alloted period time window. Originally, ROAR-Sentence was 3 minutes long but (Yeatman et al. 2024) demonstrated the cutting the time in half to 90 seconds had very little impact on reliability and validity of the measure. ROAR-Sentence consists of a collection of equated test forms where sentences are presented in a fixed order. We first report our methodology for equating test forms (Section 16.1) and then report alternate form reliability (Section 16.3).
16.1 Equating ROAR-Sentence test forms
ROAR-Sentence consists of multiple parallel forms co-developed by human researchers and generative AI. Generative AI greatly reduces the time and resources required to create multiple tests forms, and (Zelikman 2023) has shown that the quality of AI-generated forms is highly comparable to those created by humans. These test forms were equated using equipercentile equating through the equate package.
Figure 16.1 a) shows the equated scores across the test forms. Equating enables ROAR-Sentence to randomly select from multiple available test forms, thereby minimizing the potential for practice effects when students encounter the same test form across different testing windows. Figure 16.2 b) provides a separate plot showing the standard error of the equipercentile equating.
16.2 Criteria for identifying disengaged participants and flagging unreliable scores
ROAR-Sentence is designed to be totally automated: reading is done silently, responses are non-verbal, instructions and practice trials are narrated by characters, and scoring is done automatically in real time. This makes it possible to efficiently assess a whole district simultaneously. A concern about automated assessments is that without a teacher to individually administer items, monitor, and score responses, some students might disengage and provide data that is not representative of their true ability. For a measure like ROAR-Sentence where items are designed and validated to have an unambiguous and clear answer, disengaged participants can be detected based on fast and innacurate responses. Our approach to identifying and flagging disegnaged participants with unreliable scores was published in (Yeatman et al. 2024). Figure 16.3 shows a plot of median response time (RT) versus proportion correct for each participant. Most participants were very accurate (>90% correct responses). However there was a bimodal distribution indicating a small group of participants who were performing around chance. These participants also had extremely fast response times.
Participants with a median response time <1,000ms AND low accuracy (<65% correct) are flagged as unreliable scores in ROAR score reports and are excluded from analyses since scores do not accurately represent the participant’s ability. Teachers can choose whether to re-administer ROAR or interpret data cautiously in relation to other data sources and contextual factors.
16.3 Alternate form reliability
Alternate form reliability is computed as the Pearson correlation between scores on equated test forms that were administered during the same testing session. Figure 16.4 (a) shows a plot of student scores on alternate test forms combining grades and Figure 16.4 (b) shows separate plots for each grade. Table 16.1 reports alternate form reliability for the full sample and separately by grade.
Grade | Alternate Form Reliability | N |
---|---|---|
All | 0.82 | 7186 |
1 | 0.80 | 133 |
2 | 0.89 | 316 |
3 | 0.90 | 364 |
4 | 0.85 | 84 |
5 | 0.78 | 83 |
6 | 0.82 | 540 |
7 | 0.80 | 413 |
8 | 0.80 | 543 |
9 | 0.75 | 1468 |
10 | 0.79 | 1280 |
11 | 0.77 | 1079 |
12 | 0.74 | 883 |
Race/Ethnicity | N | Correlation |
---|---|---|
Asian | 664 | 0.889 |
Hispanic | 64 | 0.946 |
Multiracial | 398 | 0.890 |
White | 730 | 0.905 |