9 Foundational Skills Composite Score (ROAR-Composite)

The ROAR assessment suite provides scores for each of the individual measures (see Section 2.1 for a description of the Foundational Reading Skills Suite). These scores provide information about different facets of the underlying construct. Patterns of performance across the measures can be useful for understanding students’ relative strengths and weaknesses in addition to informing instructional decisions. However, there are instances where an omnibus indicator of student performance is desirable.

9.1 Simple Weighted Composites

There are a variety of approaches that can be used to create composite scores from a set of subtest results, although composite scores are generally conceptualized as a linear combination of two or more component scores (Lord and Novick 1968; Nunnally and Bernstein 1994). That is, the composite score is a weighted sum of the individual component scores.

\[ C = \mathbf{w}^\top \mathbf{X} = \sum_{i=1}^{k} w_i X_i \]

where \(C\) is the composite score, \(X_i\) is the \(i\)-th component score, \(w_i\) is the weight assigned to the \(i\)-th component, and \(k\) is the number of components. Depending on the intended use of the composite scores, different approaches can be used to identify the weights. For instance, one could choose weights that emphasize components in terms or relevance or importance, maximize the overall reliability, or maximize the correlation with an external measure or criterion, among others.

Another important consideration in the identification of weights is the relationship between the component scores. If the components are mostly independent, the effect of the weights is unlikely to inflate or suppress the score contribution of individual components. As such, the resulting composite will be compensatory; low scores on one measure can be compensated for by high scores on another measure to provide a midrange composite score. By extension, low/high composite scores will generally reflect low/high scores across all measures. Conversely, when the component scores are correlated, the selected weights will interact with the underlying covariance structure. This can potentially distort inferences about individuals’ overall performance. For instance, higher performance on a more heavily weighted skill could mask lower performance on less weighted skill; the individual might receive a high composite score rather than, say, a midrange score. For this reason, it is important to consider both the component scores and the composite score when drawing inferences about student performance.

9.2 IRT-Based Composites

As an alternative to establishing composite scores via a linear combination of component scores, composite scales can be established within a latent variable framework. If the underlying construct is essentially unidimensional (Stout 1987; Nandakumar 1991), the item responses from all the component measures can be calibrated concurrently to create a composite scale. This is the approach most commonly used for large-scale educational assessments. On the other hand, if the components measure different but related facets of the construct, component scores and scales can be modeled using a factor analytic method like multidimensional item response theory [MIRT; Reckase (2009)] then projected onto a composite scale (Strachan et al. 2021; Reise et al. 2025). Note that the projection method based on MIRT parameters is not entirely dissimilar to what is happening in the unidimensional case. The key distinction is that if the data are fit using a unidimensional model, the underlying factors (or components) are projected onto a single scale as a weighted linear composite (Wang 1986; Reckase 2009) where the weights essentially maximize the variance explained by the composite scores. The projective IRT approach, on the other hand, provides a separate projection for each item rather than a common weight for each component.

9.3 ROAR Composite

The overall ROAR composite uses a combination of IRT and a weighted sum approach. The individual scales for ROAR-Letter, ROAR-Word, and ROAR-Phoneme are based on a modified Rasch model; whereas, the scale for ROAR-sentence is a simple sum score. An examination of the multidimensional structure of these four measures suggests that they measure separate but related facets of the construct (link to tech report section).

As a first step in developing an overall ROAR composite, we created an IRT-based composite using the Letter, Word, and Phoneme measures. Several potential composites were considered: projective IRT composites based on a simple structure (confirmatory factor analysis) model and a bifactor model; and unidimensional composites derived from concurrently calibrated items using a modified Rasch model and 2PL model. The scores for each composite were compared in terms of reliability, predictive validity, and classification accuracy (overall and at each grade level). Based on these comparisons, we determined that the modified Rasch model provided the best indicator of student performance for Letter, Word, and Phoneme. Note, this approach provides composite scores even if students do not take all three measures.

With the IRT-based composite in place, we moved to the integration of ROAR-Sentence scores. Since ROAR-sentence is not often administered in earlier grades, composite scores that include ROAR-sentence as a component will likely include missing data. Using all available data, Sentence scores were imputed using K-nearest neighbor [KNN; Troyanskaya et al. (2001)] to provide a complete set of results to include in a simple weighted composite. The actual and imputed Sentence scores were evaluated in combination with the modified Rasch composite scores using a principal components analysis (PCA). This provided the weights (the eigenvector for the first principal component) for the IRT-based composite and Sentence scores. Note, these weights are not re-estimated each time overall composite scores are computed; rather, they are used with the IRT-based composite scores and Sentence scores (or imputed Sentence scores) as a simple weighted sum.

9.4 Reliability

Score reliability for the IRT-based composite can be computed using the standard formula for marginal reliability; whereas, reliability for the overall weighted composite is computed using a special case of the Spearman-Brown formula (citation). Under classical test theory (CTT), score reliability is defined as the ratio of true-score variance to observed variance:

\[ \rho_{XX} = \frac{\mathrm{Var}(T_X)}{\mathrm{Var}(X)} \]

where \(T_X\) is the true score. Using this definition, IRT marginal reliability—based on expected a posteriori (EAP) theta estimates—can be computed as

\[ \rho_{\text{marginal}} = \frac{\mathrm{Var}(\theta)} {\mathrm{Var}(\theta) + \mathrm{Var}(e)} \]

where the error variance, \(\mathrm{Var}(e)\), is defined as the expected posterior variance of the latent trait across examinees:

\[ \mathrm{Var}(e) = \mathrm{E}\!\left[\mathrm{Var}(\theta \mid \mathbf{u})\right] = \frac{1}{N} \sum_{i=1}^{N} \mathrm{Var}(\theta \mid \mathbf{u}_i) \]

\(\theta\) = Latent trait being measured.

\(\mathbf{u}_i\) = Response vector for examinee \(i\).

\(\mathrm{Var}(\theta)\) = Population variance of the latent trait.

\(\mathrm{Var}(\theta \mid \mathbf{u}_i)\) = Posterior variance of \(\theta\) for examinee \(i\).

\(N\) = Number of examinees.

The CTT definition of reliability can be readily extended to composite scores:

\[ \rho_{CC} = \frac{\mathrm{Var}(T_C)}{\mathrm{Var}(C)} \]

where the composite true score is equal to

\[ T_C = \mathbf{w}^\top \mathbf{T}. \]

The general formula for composite reliability (in matrix form) becomes (Mosier 1943):

\[ \rho_{CC} = \frac{\mathbf{w}^\top \boldsymbol{\Sigma}_T \mathbf{w}} {\mathbf{w}^\top \boldsymbol{\Sigma}_X \mathbf{w}} \]

This formulation allows for correlated errors among the component scores; however, if we assume that the errors are uncorrelated, the formula reduces to

\[ \rho_{CC} = \frac{ \sum_{i=1}^{k} w_i^2 \, \rho_{ii} \, \sigma_i^2 \;+\; 2 \sum_{i<j} w_i w_j \, \sigma_{ij} }{ \sum_{i=1}^{k} w_i^2 \, \sigma_i^2 \;+\; 2 \sum_{i<j} w_i w_j \, \sigma_{ij} } \]

\(k\) = Number of component scores in the composite.

\(X_i\) = Observed score on component \(i\).

\(w_i\) = Weight assigned to component \(i\).

\(w_j\) = Weight assigned to component \(j\).

\(\rho_{ii}\) = Reliability of component \(i\).

\(\sigma_i^2 = Var(X_i)\) = Variance of component \(i\).

\(\sigma_{ij} = Cov(X_i, X_j)\) = Covariance between components \(i\) and \(j\).

9.5 Composite score reliability estimates

Table 1 shows the marginal reliabilities for the IRT composite based on Letter, Word, and Phoneme items.

Grade	Overall	K	1	2	3
Reliability	0.96	0.94	0.93	0.94	0.94
Number of Students	32419	2658	3115	3539	2037

Table 2 shows the overall composite score reliabilities for the combined IRT composite and Sentence scores. Note, no overall composite reliability is reported for kindergarteners. This is because the number of students taking the Sentence test is very small, making imputation of Sentence scores for these student untenable. The composite weights for the IRT and Sentence components, respectively, are \(w_1\) = 0.6925362 and \(w_2\) =0.7213831.

Grade	Overall	1	2	3
Reliability	0.95	0.95	0.96	0.96
Number of Students	32642	3121	3559	2043

References

Lord, Frederic M., and Melvin R. Novick. 1968. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.

Mosier, Charles I. 1943. “On the Reliability of a Weighted Composite.” Psychometrika 8 (3): 161–68. https://doi.org/10.1007/BF02288700.

Nandakumar, Ratna. 1991. “Traditional Dimensionality Versus Essential Dimensionality.” Journal of Educational Measurement 28 (2): 99–117.

Nunnally, Jum C., and Ira H. Bernstein. 1994. Psychometric Theory. 3rd ed. New York: McGraw-Hill.

Reckase, Mark D. 2009. Multidimensional Item Response Theory. New York: Springer. https://doi.org/10.1007/978-0-387-89976-3.

Reise, Steven P, Jared M Block, Maxwell Mansolf, Mark G Haviland, Benjamin D Schalet, and Rachel Kimerling. 2025. “Using Projective IRT to Evaluate the Effects of Multidimensionality on Unidimensional IRT Model Parameters.” Multivariate Behavioral Research 60 (2): 345–61.

Stout, William. 1987. “A Nonparametric Approach for Assessing Latent Trait Unidimensionality.” Psychometrika 52 (4): 589–617.

Strachan, Tyler, Uk Hyun Cho, Kyung Yong Kim, John T Willse, Shyh-Huei Chen, Edward H Ip, Terry A Ackerman, and Jonathan P Weeks. 2021. “Using a Projection IRT Method for Vertical Scaling When Construct Shift Is Present.” Journal of Educational Measurement 58 (2): 211–35.

Troyanskaya, Olga, Michael Cantor, Gavin Sherlock, Patrick Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. 2001. “Missing Value Estimation Methods for DNA Microarrays.” Bioinformatics 17 (6): 520–25. https://doi.org/10.1093/bioinformatics/17.6.520.

Wang, Margaret M. 1986. “Fitting a Unidimensional Model to Multidimensional Item Response Data: The Effects of Latent Space Misspecification on the Application of IRT.”