21 Reliability of ROAR-Composite

21.1 Reliability

Score reliability for the IRT-based composite can be computed using the standard formula for marginal reliability; whereas, reliability for the overall weighted composite is computed using a special case of the Spearman-Brown formula (citation). Under classical test theory (CTT), score reliability is defined as the ratio of true-score variance to observed variance:

\[ \rho_{XX} = \frac{\mathrm{Var}(T_X)}{\mathrm{Var}(X)} \]

where \(T_X\) is the true score. Using this definition, IRT marginal reliability—based on expected a posteriori (EAP) theta estimates—can be computed as

\[ \rho_{\text{marginal}} = \frac{\mathrm{Var}(\theta)} {\mathrm{Var}(\theta) + \mathrm{Var}(e)} \]

where the error variance, \(\mathrm{Var}(e)\), is defined as the expected posterior variance of the latent trait across examinees:

\[ \mathrm{Var}(e) = \mathrm{E}\!\left[\mathrm{Var}(\theta \mid \mathbf{u})\right] = \frac{1}{N} \sum_{i=1}^{N} \mathrm{Var}(\theta \mid \mathbf{u}_i) \]

\(\theta\) = Latent trait being measured.

\(\mathbf{u}_i\) = Response vector for examinee \(i\).

\(\mathrm{Var}(\theta)\) = Population variance of the latent trait.

\(\mathrm{Var}(\theta \mid \mathbf{u}_i)\) = Posterior variance of \(\theta\) for examinee \(i\).

\(N\) = Number of examinees.

The CTT definition of reliability can be readily extended to composite scores:

\[ \rho_{CC} = \frac{\mathrm{Var}(T_C)}{\mathrm{Var}(C)} \]

where the composite true score is equal to

\[ T_C = \mathbf{w}^\top \mathbf{T}. \]

The general formula for composite reliability (in matrix form) becomes (Mosier 1943):

\[ \rho_{CC} = \frac{\mathbf{w}^\top \boldsymbol{\Sigma}_T \mathbf{w}} {\mathbf{w}^\top \boldsymbol{\Sigma}_X \mathbf{w}} \]

This formulation allows for correlated errors among the component scores; however, if we assume that the errors are uncorrelated, the formula reduces to

\[ \rho_{CC} = \frac{ \sum_{i=1}^{k} w_i^2 \, \rho_{ii} \, \sigma_i^2 \;+\; 2 \sum_{i<j} w_i w_j \, \sigma_{ij} }{ \sum_{i=1}^{k} w_i^2 \, \sigma_i^2 \;+\; 2 \sum_{i<j} w_i w_j \, \sigma_{ij} } \]

\(k\) = Number of component scores in the composite.

\(X_i\) = Observed score on component \(i\).

\(w_i\) = Weight assigned to component \(i\).

\(w_j\) = Weight assigned to component \(j\).

\(\rho_{ii}\) = Reliability of component \(i\).

\(\sigma_i^2 = Var(X_i)\) = Variance of component \(i\).

\(\sigma_{ij} = Cov(X_i, X_j)\) = Covariance between components \(i\) and \(j\).

21.2 Composite score reliability estimates

The empirical reliability of the foundational skills composite in the calibration sample was 0.93 (0.93 to 0.93). The marginal reliability was 0.96. Students who answered every administered item correctly were excluded from the reliability calculation (1223 of 34472 excluded; 33249 retained).

Grade	N	Empirical Reliability	95% CI
All	11103	0.91	0.91 to 0.91
1	4238	0.88	0.88 to 0.89
2	3233	0.87	0.86 to 0.88
Kindergarten	3632	0.89	0.89 to 0.90

Table 21.1: Empirical reliability of ROAR-Composite by grade.

Gender	N	Empirical Reliability	95% CI
All	19457	0.94	0.94 to 0.94
Female	9518	0.94	0.94 to 0.94
Male	9939	0.94	0.94 to 0.94

Table 21.2: Empirical reliability of ROAR-Composite by gender.

Free or Reduced Lunch	N	Empirical Reliability	95% CI
All	4852	0.94	0.94 to 0.94
Free/Reduced	1301	0.93	0.92 to 0.93
Paid	3551	0.94	0.93 to 0.94

Table 21.3: Empirical reliability of ROAR-Composite by eligibility for free or reduced price lunch.

English learner status	N	Empirical Reliability	95% CI
All	6238	0.94	0.94 to 0.94
English Learner	1796	0.94	0.94 to 0.94
English Only	3567	0.94	0.93 to 0.94
Initial Fluent English Proficient	643	0.93	0.92 to 0.94
Reclassified Fluency English Proficient	232	0.91	0.88 to 0.92

Table 21.4: Empirical reliability of ROAR-Composite by English learner status.

Primary language	N	Empirical Reliability	95% CI
All	4584	0.94	0.94 to 0.94
English	2968	0.94	0.93 to 0.94
Other	774	0.92	0.92 to 0.93
Spanish	842	0.93	0.92 to 0.94

Table 21.5: Empirical reliability of ROAR-Composite by primary language.

IEP / Special Education	N	Empirical Reliability	95% CI
All	5503	0.94	0.94 to 0.94
No	5123	0.94	0.94 to 0.94
Yes	380	0.94	0.92 to 0.94

Table 21.6: Empirical reliability of ROAR-Composite by IEP / special education status.

Hispanic ethnicity	N	Empirical Reliability	95% CI
All	19754	0.94	0.94 to 0.94
No	14368	0.94	0.93 to 0.94
Yes	5386	0.95	0.95 to 0.95

Table 21.7: Empirical reliability of ROAR-Composite by Hispanic ethnicity.

Race	N	Empirical Reliability	95% CI
All	15645	0.94	0.94 to 0.95
American Indian/Alaska Native	81	0.96	0.94 to 0.97
Asian	1533	0.94	0.93 to 0.94
Black/African American	2014	0.92	0.92 to 0.93
Hispanic/Latinx	3054	0.94	0.93 to 0.94
Multiracial	4647	0.92	0.92 to 0.93
Native Hawaiian/Other Pacific Islander	51	0.93	0.89 to 0.95
White	4265	0.95	0.95 to 0.95

Table 21.8: Empirical reliability of ROAR-Composite by race.

References

Mosier, Charles I. 1943. “On the Reliability of a Weighted Composite.” Psychometrika 8 (3): 161–68. https://doi.org/10.1007/BF02288700.