18  Reliability of ROAR-Letter

18.1 Reliability of calibration sample

A Rasch model was fit to the ROAR-Letter calibration sample (see Table 18.1 and Table 8.1). All ROAR-Letter items fit the model well (see Chapter 4 for fit criteria). Calibration data was obtained from 1986 students who took ROAR-Letter. Two versions of ROAR-Letter were administered: 1217 students took the full, 88 item version (with unlimited time) and 769 students took a shorter version consisting of 10 letter names and up to 36 letter-sound items, with testing time to limited 5 minutes. On average the latter group completed 45 items (10 letter-name and 35 letter-sound items). Based on this IRT model, overall marginal reliability for ROAR-Letter was calculated to be 0.88 using 25 trials from each participant. Figure 18.2 shows the upper and lower bounds on reliability as a function of the number of items that a participant completes.

Figure 18.1: ROAR-Letter Reliability

25 Trials

Item selection: Fisher information

Marginal reliability = 0.88

Figure 18.2: Letter-CAT simulation based on item-response data in 4,041 kindergarten and first grade students. Items were sampled in 3 different ways and marginal reliability was calculated as a function of the number of items that each participant completed. The simulation shows that the choice of items has a major impact on the reliability of the measure. For Optimal sampling (green) the N items with difficulty closest to the participant’s theta estimate were used. For Random sampling (orange) a random sample of N items were taken for each participant. For Worst sampling (purple) the N items with difficulty furthest from the participant’s theta estimate were used. This simulation highlights the massive efficiency gain that would be possible from an optimized CAT.

18.2 Reliability of computer-adaptive ROAR-Letter

A computer-adaptive version of ROAR-Letter is planned for release in fall 2024 and will be more efficient and reliable with fewer items. Using data from the calibration sample, we ran a CAT simulation as described in (Ma et al. 2023) to determine the final item selection criteria that would maximize reliability in the fewest number of trials (Figure 18.2). After selecting 25 trials as the most efficient number, we simulated a 25-trial computer adaptive test using participant responses.

Reliability (\(\rho_{xx^\prime}\)) is computed based on the estimated variance of \(\hat{\theta}\) relative to the estimated standard error (\(\widehat{SE}(\hat{\theta})^2\)) using Equation 22.1.

Table 18.2 reports marginal reliability by grade, based on a 25-item simulation of an optimal item selection algorithm using participant data. To ensure that ROAR-Letter is fair and equitable for different demographic groups, we also report reliability by gender Table 18.3, eligibility for free and reduced price lunch Table 18.4, English learner status based on state of California designations (Table 18.5), primary langauge spoken Table 18.6, special education Table 18.7, ethnicity Table 18.8, and race Table 18.9.

N % % Missing
Female 451 22.71 53.02
Free or Reduced Lunch 335 16.87 60.88
Race/Ethnicity
Hispanic Ethnicity 426 21.45 50.60
White 326 16.41 50.60
Black or African American 62 3.12 50.60
Asian 84 4.23 50.60
American Indian or Alaska Native 1 0.05 50.60
Hawaiian or Other Pacific Islander 11 0.55 50.60
Multiracial 19 0.96 50.60
Total 1986
Table 18.1: Demographics of ROAR-Letter calibration sample.
Grade Empirical Reliability N
All 0.88 1986
K 0.90 894
1 0.80 678
2 0.71 414
Table 18.2: Reliability of computer-adaptive ROAR-Letter by Grade (simulated 25-item CAT)
Gender Empirical Reliability N
F 0.82 451
M 0.81 482
Table 18.3: Reliability of computer-adaptive ROAR-Letter by Gender (simulated 25-item CAT) (F=female, M=male)
Free/Reduced Lunch Status Empirical Reliability N
F 0.80 244
P 0.73 442
R 0.79 91
Table 18.4: Reliability of computer-adaptive ROAR-Letter by FRL (simulated 25-item CAT) (F=Free, P=Paid, R=Reduced)
English Learner Status Empirical Reliability N
EL 0.82 318
EO 0.71 438
IFEP NA 48
RFEP NA 12
Table 18.5: Reliability of computer-adaptive ROAR-Letter by EL Status (simulated 25-item CAT) (EL=English Learner, EO=English Only, IFEP=Initial Fluent English Proficient, RFEP=Reclassified Fluent English Proficient) (not reported for N<50)
Primary Language Empirical Reliability N
English 0.73 532
Other NA 3
Spanish 0.81 219
Table 18.6: Reliability of computer-adaptive ROAR-Letter by Primary Language (simulated 25-item CAT) (not reported for N<50)
Special Education Status Empirical Reliability N
0 0.78 728
1 0.82 88
Table 18.7: Reliability of computer-adaptive ROAR-Letter by Special Education Status (simulated 25-item CAT)
Hispanic Ethnicity Empirical Reliability N
0 0.82 555
1 0.80 426
Table 18.8: Reliability of computer-adaptive ROAR-Letter by Hispanic Ethnicity (simulated 25-item CAT)
Race Empirical Reliability N
American Indian or Alaska Native NA 1
Asian 0.70 84
Black or African American NA 11
Filipino NA 7
Hawaiian or Other Pacific Islander NA 11
Hispanic 0.80 426
White 0.67 240
Table 18.9: Reliability of computer-adaptive ROAR-Letter by Race (simulated 25-item CAT) (not reported for N<50)

References

Ma, Wanjing A, Adam Richie-Halford, Amy Burkhardt, Klint Kanopka, Clementine Chou, Benjamin Domingue, and Jason D Yeatman. 2023. ROAR-CAT: Rapid Online Assessment of Reading Ability with Computerized Adaptive Testing.”