9  Foundational Skills Composite Score (ROAR-Composite)

The ROAR assessment suite provides scores for each of the individual measures (see Section 2.1 for a description of the Foundational Reading Skills Suite). These scores provide information about different facets of the underlying construct. Patterns of performance across the measures can be useful for understanding students’ relative strengths and weaknesses in addition to informing instructional decisions. However, there are instances where an omnibus indicator of student performance is desirable.

9.1 Simple Weighted Composites

There are a variety of approaches that can be used to create composite scores from a set of subtest results, although composite scores are generally conceptualized as a linear combination of two or more component scores (Lord and Novick 1968; Nunnally and Bernstein 1994). That is, the composite score is a weighted sum of the individual component scores.

\[ C = \mathbf{w}^\top \mathbf{X} = \sum_{i=1}^{k} w_i X_i \]

where \(C\) is the composite score, \(X_i\) is the \(i\)-th component score, \(w_i\) is the weight assigned to the \(i\)-th component, and \(k\) is the number of components. Depending on the intended use of the composite scores, different approaches can be used to identify the weights. For instance, one could choose weights that emphasize components in terms or relevance or importance, maximize the overall reliability, or maximize the correlation with an external measure or criterion, among others.

Another important consideration in the identification of weights is the relationship between the component scores. If the components are mostly independent, the effect of the weights is unlikely to inflate or suppress the score contribution of individual components. As such, the resulting composite will be compensatory; low scores on one measure can be compensated for by high scores on another measure to provide a midrange composite score. By extension, low/high composite scores will generally reflect low/high scores across all measures. Conversely, when the component scores are correlated, the selected weights will interact with the underlying covariance structure. This can potentially distort inferences about individuals’ overall performance. For instance, higher performance on a more heavily weighted skill could mask lower performance on less weighted skill; the individual might receive a high composite score rather than, say, a midrange score. For this reason, it is important to consider both the component scores and the composite score when drawing inferences about student performance.

9.2 IRT-Based Composites

As an alternative to establishing composite scores via a linear combination of component scores, composite scales can be established within a latent variable framework. If the underlying construct is essentially unidimensional (Stout 1987; Nandakumar 1991), the item responses from all the component measures can be calibrated concurrently to create a composite scale. This is the approach most commonly used for large-scale educational assessments. On the other hand, if the components measure different but related facets of the construct, component scores and scales can be modeled using a factor analytic method like multidimensional item response theory [MIRT; Reckase (2009)] then projected onto a composite scale (Strachan et al. 2021; Reise et al. 2025). Note that the projection method based on MIRT parameters is not entirely dissimilar to what is happening in the unidimensional case. The key distinction is that if the data are fit using a unidimensional model, the underlying factors (or components) are projected onto a single scale as a weighted linear composite (Wang 1986; Reckase 2009) where the weights essentially maximize the variance explained by the composite scores. The projective IRT approach, on the other hand, provides a separate projection for each item rather than a common weight for each component.

9.3 ROAR Composite

The overall ROAR composite uses a combination of IRT and a weighted sum approach. The individual scales for ROAR-Letter, ROAR-Word, and ROAR-Phoneme are based on a modified Rasch model; whereas, the scale for ROAR-Sentence is a simple sum score. An examination of the multidimensional structure of these four measures suggests that they measure separate but related facets of the construct (link to tech report section).

As a first step in developing an overall ROAR composite, we created an IRT-based composite using the Letter, Word, and Phoneme measures. Several potential composites were considered: projective IRT composites based on a simple structure (confirmatory factor analysis) model and a bifactor model; and unidimensional composites derived from concurrently calibrated items using a modified Rasch model and 2PL model. The scores for each composite were compared in terms of reliability, predictive validity, and classification accuracy (overall and at each grade level). Based on these comparisons, we determined that the modified Rasch model provided the best indicator of student performance for Letter, Word, and Phoneme. Note, this approach provides composite scores even if students do not take all three measures.

With the IRT-based composite in place, we moved to the integration of ROAR-Sentence scores. Since ROAR-Sentence is not often administered in earlier grades, composite scores that include ROAR-Sentence as a component will likely include missing data. Using all available data, Sentence scores were imputed using K-nearest neighbor [KNN; Troyanskaya et al. (2001)] to provide a complete set of results to include in a simple weighted composite. The actual and imputed Sentence scores were evaluated in combination with the modified Rasch composite scores using a principal components analysis (PCA). This provided the weights (the eigenvector for the first principal component) for the IRT-based composite and Sentence scores. Note, these weights are not re-estimated each time overall composite scores are computed; rather, they are used with the IRT-based composite scores and Sentence scores (or imputed Sentence scores) as a simple weighted sum.

9.4 Operational IRT Calibration

The operational composite calibration in this section is the foundational skills IRT composite based on pooled ROAR-Letter, ROAR-Phoneme, and ROAR-Word item responses. This section fits the pooled IRT model, saves the calibrated item parameters, and writes the model object used later by the reliability section.

9.5 IRT Model

The composite calibration sample included 34,472 participants and 930 pooled items from the Letter, Phoneme, and Word measures.

The fitted composite model is then used to estimate the participant’s foundational skills composite Scaled Score (\(\theta\)), placing pooled performance from the component measures onto a common latent continuum.

9.6 Scoring

For the foundational skills portion of ROAR-Composite, pooled item responses from ROAR-Letter, ROAR-Phoneme, and ROAR-Word are placed on a common latent scale using a modified Rasch model. Letter items use \(g = 0.25\), phoneme items use \(g = 0.33\), and Word items use \(g = 0.50\). For the composite calibration, all item parameters (Letter, Phoneme, and Word) are fixed to the values from the operational composite calibration so that this run inherits the established item calibration rather than re-estimating any items.

Reliability estimates and subgroup reliability tables for the composite score are reported in Chapter 20.

References

Lord, Frederic M., and Melvin R. Novick. 1968. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.
Nandakumar, Ratna. 1991. “Traditional Dimensionality Versus Essential Dimensionality.” Journal of Educational Measurement 28 (2): 99–117.
Nunnally, Jum C., and Ira H. Bernstein. 1994. Psychometric Theory. 3rd ed. New York: McGraw-Hill.
Reckase, Mark D. 2009. Multidimensional Item Response Theory. New York: Springer. https://doi.org/10.1007/978-0-387-89976-3.
Reise, Steven P, Jared M Block, Maxwell Mansolf, Mark G Haviland, Benjamin D Schalet, and Rachel Kimerling. 2025. “Using Projective IRT to Evaluate the Effects of Multidimensionality on Unidimensional IRT Model Parameters.” Multivariate Behavioral Research 60 (2): 345–61.
Stout, William. 1987. “A Nonparametric Approach for Assessing Latent Trait Unidimensionality.” Psychometrika 52 (4): 589–617.
Strachan, Tyler, Uk Hyun Cho, Kyung Yong Kim, John T Willse, Shyh-Huei Chen, Edward H Ip, Terry A Ackerman, and Jonathan P Weeks. 2021. “Using a Projection IRT Method for Vertical Scaling When Construct Shift Is Present.” Journal of Educational Measurement 58 (2): 211–35.
Troyanskaya, Olga, Michael Cantor, Gavin Sherlock, Patrick Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. 2001. “Missing Value Estimation Methods for DNA Microarrays.” Bioinformatics 17 (6): 520–25. https://doi.org/10.1093/bioinformatics/17.6.520.
Wang, Margaret M. 1986. “Fitting a Unidimensional Model to Multidimensional Item Response Data: The Effects of Latent Space Misspecification on the Application of IRT.”