20  Reliability of ROAR-Composite

20.1 Reliability

Score reliability for the IRT-based composite can be computed using the standard formula for marginal reliability; whereas, reliability for the overall weighted composite is computed using a special case of the Spearman-Brown formula (citation). Under classical test theory (CTT), score reliability is defined as the ratio of true-score variance to observed variance:

\[ \rho_{XX} = \frac{\mathrm{Var}(T_X)}{\mathrm{Var}(X)} \]

where \(T_X\) is the true score. Using this definition, IRT marginal reliability—based on expected a posteriori (EAP) theta estimates—can be computed as

\[ \rho_{\text{marginal}} = \frac{\mathrm{Var}(\theta)} {\mathrm{Var}(\theta) + \mathrm{Var}(e)} \]

where the error variance, \(\mathrm{Var}(e)\), is defined as the expected posterior variance of the latent trait across examinees:

\[ \mathrm{Var}(e) = \mathrm{E}\!\left[\mathrm{Var}(\theta \mid \mathbf{u})\right] = \frac{1}{N} \sum_{i=1}^{N} \mathrm{Var}(\theta \mid \mathbf{u}_i) \]

\(\theta\) = Latent trait being measured.

\(\mathbf{u}_i\) = Response vector for examinee \(i\).

\(\mathrm{Var}(\theta)\) = Population variance of the latent trait.

\(\mathrm{Var}(\theta \mid \mathbf{u}_i)\) = Posterior variance of \(\theta\) for examinee \(i\).

\(N\) = Number of examinees.


The CTT definition of reliability can be readily extended to composite scores:

\[ \rho_{CC} = \frac{\mathrm{Var}(T_C)}{\mathrm{Var}(C)} \]

where the composite true score is equal to

\[ T_C = \mathbf{w}^\top \mathbf{T}. \]

The general formula for composite reliability (in matrix form) becomes (Mosier 1943):

\[ \rho_{CC} = \frac{\mathbf{w}^\top \boldsymbol{\Sigma}_T \mathbf{w}} {\mathbf{w}^\top \boldsymbol{\Sigma}_X \mathbf{w}} \]

This formulation allows for correlated errors among the component scores; however, if we assume that the errors are uncorrelated, the formula reduces to

\[ \rho_{CC} = \frac{ \sum_{i=1}^{k} w_i^2 \, \rho_{ii} \, \sigma_i^2 \;+\; 2 \sum_{i<j} w_i w_j \, \sigma_{ij} }{ \sum_{i=1}^{k} w_i^2 \, \sigma_i^2 \;+\; 2 \sum_{i<j} w_i w_j \, \sigma_{ij} } \]

\(k\) = Number of component scores in the composite.

\(X_i\) = Observed score on component \(i\).

\(w_i\) = Weight assigned to component \(i\).

\(w_j\) = Weight assigned to component \(j\).

\(\rho_{ii}\) = Reliability of component \(i\).

\(\sigma_i^2 = Var(X_i)\) = Variance of component \(i\).

\(\sigma_{ij} = Cov(X_i, X_j)\) = Covariance between components \(i\) and \(j\).

20.2 Composite score reliability estimates

The empirical reliability of the foundational skills composite in the calibration sample was 0.93 (0.93 to 0.93). The marginal reliability was 0.96. Students who answered every administered item correctly were excluded from the reliability calculation (1223 of 34472 excluded; 33249 retained).

Grade N Empirical Reliability 95% CI
All 11103 0.91 0.91 to 0.91
1 4238 0.88 0.88 to 0.89
2 3233 0.87 0.86 to 0.88
Kindergarten 3632 0.89 0.89 to 0.90
Table 20.1: Empirical reliability of ROAR-Composite by grade.
Gender N Empirical Reliability 95% CI
All 19457 0.94 0.94 to 0.94
Female 9518 0.94 0.94 to 0.94
Male 9939 0.94 0.94 to 0.94
Table 20.2: Empirical reliability of ROAR-Composite by gender.
Free or Reduced Lunch N Empirical Reliability 95% CI
All 4852 0.94 0.94 to 0.94
Free/Reduced 1301 0.93 0.92 to 0.93
Paid 3551 0.94 0.93 to 0.94
Table 20.3: Empirical reliability of ROAR-Composite by eligibility for free or reduced price lunch.
English learner status N Empirical Reliability 95% CI
All 6238 0.94 0.94 to 0.94
English Learner 1796 0.94 0.94 to 0.94
English Only 3567 0.94 0.93 to 0.94
Initial Fluent English Proficient 643 0.93 0.92 to 0.94
Reclassified Fluency English Proficient 232 0.91 0.88 to 0.92
Table 20.4: Empirical reliability of ROAR-Composite by English learner status.
Primary language N Empirical Reliability 95% CI
All 4584 0.94 0.94 to 0.94
English 2968 0.94 0.93 to 0.94
Other 774 0.92 0.92 to 0.93
Spanish 842 0.93 0.92 to 0.94
Table 20.5: Empirical reliability of ROAR-Composite by primary language.
IEP / Special Education N Empirical Reliability 95% CI
All 5503 0.94 0.94 to 0.94
No 5123 0.94 0.94 to 0.94
Yes 380 0.94 0.92 to 0.94
Table 20.6: Empirical reliability of ROAR-Composite by IEP / special education status.
Hispanic ethnicity N Empirical Reliability 95% CI
All 19754 0.94 0.94 to 0.94
No 14368 0.94 0.93 to 0.94
Yes 5386 0.95 0.95 to 0.95
Table 20.7: Empirical reliability of ROAR-Composite by Hispanic ethnicity.
Race N Empirical Reliability 95% CI
All 15645 0.94 0.94 to 0.95
American Indian/Alaska Native 81 0.96 0.94 to 0.97
Asian 1533 0.94 0.93 to 0.94
Black/African American 2014 0.92 0.92 to 0.93
Hispanic/Latinx 3054 0.94 0.93 to 0.94
Multiracial 4647 0.92 0.92 to 0.93
Native Hawaiian/Other Pacific Islander 51 0.93 0.89 to 0.95
White 4265 0.95 0.95 to 0.95
Table 20.8: Empirical reliability of ROAR-Composite by race.

References

Mosier, Charles I. 1943. “On the Reliability of a Weighted Composite.” Psychometrika 8 (3): 161–68. https://doi.org/10.1007/BF02288700.