RTCOM Briefs

Brief 4: The Basics of Psychometrically Sound HCBS Outcome Measurement

Introduction

Within each state, Home and Community-Based Services (HCBS) operate within allocated budgets. After being declared eligible to receive HCBS, many individuals with disabilities including those with the most extensive support needs (e.g., people with intellectual and developmental disabilities - IDD) find themselves on waitlists due to insufficient resources to cover services (Larson et al., 2018). This problem extends to persons with physical, age-related, and psychiatric disabilities as well as traumatic brain injury (KFF Medicaid HCBS Program Surveys, FY 2018). In some states (e.g., New Mexico, Oklahoma), there are more eligible people with IDD on current waitlists than there are receiving services. Further, people who receive services to increase their quality of life often experience less than satisfactory outcomes due to a system that receives inadequate funding resulting in a lack of stability in the direct care workforce (i.e., high turnover of direct support professionals and many unfilled positions (Houseworth et. al., 2020). Within such a system, providing high-quality evidence of the outcomes associated with services and supports to policy-makers, provider organizations, people with disabilities, and their families is critical. Its importance lies in the usefulness of such information to help us understand the extent to which HCBS are fulfilling their intended functions. These include determination of: (1) the extent to which states are in compliance with the HCBS final settings rule; (2) the effectiveness of the system in making equitable resource allocation decisions in terms of services and supports; (3) areas in which steps need to be taken by providers to support improvements in the quality of services they provide to HCBS beneficiaries; and (4) the specific service providers that people with disabilities choose to provide them with services. This information at local, state, and national levels is crucial to the lives of millions of persons with disabilities supporting not only their self-determination but overall quality of life.

It is paramount that evidence about the extent to which HCBS are having their intended impact is based on information/data that are psychometrically sound. By this we mean that there is high quality research evidence available to support the argument that the data derived from the HCBS outcome measurement tools in use are both reliable and valid (American Educational Research Association - AERA, American Psychological Association - APA, & National Council on Measurement in Education – NCME; 2014). Without sufficient evidence supporting the reliability and validity of measures, decision-making fails to be adequately informed. Responsible federal and state agencies cannot make accurate judgements about the degree to which states and provider organizations are in compliance with existing regulations. States cannot make informed decisions about appropriate funding allocations, and areas in need of program development. Provider agencies are left in the dark when choosing in which areas they need to innovate to improve services. At the individual level, HCBS beneficiaries may be left with information that is inaccurate and potentially misleading when it comes to making decisions about which provider to select.

Figure 1. Reliability and Validity

Within the context of Home and Community-Based Services, psychometrically-sound measurement provides consistent and accurate information about the quality of supports and the outcomes people with disabilities experience. Figure 1 illustrates the differences that occur when measures vary with respect to their reliability and validity. When measures are both consistent and accurate or reliable and valid, high quality decisions can be made.

HCBS Measurement and Psychometrics

HCBS measurement refers to a variety of efforts at federal, state, provider, and individual levels to evaluate the extent to which services and supports are of high quality and effective in providing people with disabilities with a high quality of life. This includes opportunities to engage in meaningful activities, develop meaningful social relationships, exercise choice and control over things of importance, including their services, and having access to meaningful employment (NQF, 2016). When considering HCBS outcome measurement, people with disabilities need to be the primary “informed respondents”. This means that they should be the most important group considered when undertaking measure development (to ensure we are measuring things important to them). By directly involving service recipients throughout the measurement process, one has the potential to initially validate measure format and content during the development process, rather than retrospectively.

In addition to ensuring that the recipients of services themselves are involved in all aspects of measure development, the discipline of psychometrics provides clear, specific guidelines for the development and testing of measures to determine their quality. Psychometrics is a discipline focused on the development of assessment tools and measures that are designed to connect observable phenomena (e.g., responses to items in a measure of social connectedness) to theoretical constructs (e.g., social inclusion; Borsboom, 2006). In HCBS, psychometrically-sound measurement provides consistent information from people with disabilities, and accurate information about people with disabilities and their outcomes as outlined by the most recent National Quality Forum (NQF) Framework for HCBS Outcome Measurement (NQF; 2016 RTC/OM, 2016). Reliability and validity (i.e., consistency and accuracy) represent two critical inter-related components of psychometrics (see Figure 1).

Reliability

Reliability refers to the degree of consistency of information gathered using an assessment tool or a measure. One goal of reliability is for respondents to answer the same questions in a similar manner within a relatively short window of time (i.e., 7-14 days; Weir, 2005). If a person responds differently within that period (during which there is unlikely to have been a significant change in the person’s experiences) the measure may not be reliable or the construct in question possess an extremely low level of stability (i.e., it is liable to change on a momentary basis). This is referred to as test-retest reliability. Measure developers strive to establish a balance between test-retest reliability and a measure being sufficiently sensitive to change such that it is able to detect over time real changes that take place with respect to outcomes.

A second goal for a measure is that it should produce the same or similar data when it is administered to the same individual by more than one administrator. If two interviewers administer the same measure to the same individual within a short time period and the persons responses are different, the measure would be considered to have low interrater or inter-observer reliability.

Internal consistency reliability is a measure based on the correlations that exist between different items on the same measure (or the same subscale on a larger measure). It assesses the extent to which multiple items that propose to measure the same general construct (e.g., social connectedness) produce similar scores. For example, if a respondent expressed agreement with the statements "I like to spend time with my family " and "I've enjoyed spending time with my family members in the past", and disagreement with the item "I do not like members of my family," this would be indicative of good internal consistency of the measure. Although high test-retest and inter-rater reliabilities are typically viewed as a positive characteristic of a measure, extremely high internal consistency reliability (0.95 or higher) indicates that the items of which a measure is composed are redundant or are measuring the same thing and therefore are not all needed. The goal in designing a reliable instrument is for scores on similar items to be related (internally consistent), but for each to contribute some unique information to understanding the construct under consideration.

There can be multiple threats to the reliability of a measure. Within HCBS measurement, reliability can be reduced by disability-distinct factors (e.g., mood or memory fluctuations). Reliability can also be compromised due to the failure of developers to account for the cultural and linguistic backgrounds and differences of individuals with disabilities with whom the measures are intended for use.  Interviewers may also fail to ask questions in the manner intended by the measure developers.  A fourth threat to reliability comes from poorly designed items. Items that are difficult to understand because of their complexity or the language that is used, are “double-barreled” asking the respondent to answer two questions within one item (e.g., “Support staff listen to you and act on your requests”), or provide poor response options can all lead to decreased reliability. If inadequate reliability is detected, individual items and administration procedures need to be examined to reduce threats to measure reliability. In general, unreliable information leads to decisions that are not fully informed and may lack empirical evidence.

Establishing the Reliability of a Measure

When new measures are developed, it is critical to determine various forms of their reliability. Since language and the social/cultural context change over time, the periodic reassessment of reliability should also be undertaken after a period as part of measure maintenance.

To estimate the reliability of newly developed measures of HCBS outcomes, three approaches are typically used. The first type, test-retest reliability, is established by asking participants to respond to the same measure(s) administered in the same way within a short period of time which is typically between 7-14 days. Retesting participants within this timeframe minimizes the potential occurrence of major changes in their daily life and attitudes which may affect an individual’s response. The answers of respondents for each administration of the measure are assessed for consistency (reliability) using a variety of statistical techniques. If major changes in participant responses occur, they may be due to issues with the items and how they are worded, inadequate response options, items that are too difficult to for participants to understand, fatigue, or constructs being measured that have been defined in such a way that they are highly unstable. If items or whole measures are unreliable, items and/or response options need to be modified or replaced.

The second type of reliability, inter-rater/observer reliability, also needs to be examined when a new measure is developed. This form of reliability involves two raters/interviewers independently recording the responses of the respondent. In the case of interviews or surveys, this typically entails one individual asking questions/items and both that individual and a second observer recording the responses of the person being interviewed (Saal, Downey, & Lahey, 1980). The two recorded responses are compared to examine if there are differences between observers/interviewers in terms of the responses they recorded to questions asked of the respondent. If differences occur, especially if this is consistently the case across different pairs of interviewers, it indicates that the items/measure may not have been well constructed. Similar to test-retest reliability, low agreement may be due to questions that have not been worded clearly, use complex language, the response options being unclear, etc. There can also be issues with administration procedures including unclear or complex directions, length of the measure/ administration time which can be consequently modified to improve reliability.

The form of reliability that needs to be examined is the internal consistency of a measure. Internal consistency refers to the degree to which a group of items are responded to in a consistent manner (Streiner, 2003). For example, for a measure of transportation quality to have high internal consistency, respondents need to respond consistently across items designed to address the same or similar construct. If respondents consistently answered positively to both items “I am happy with my transportation options” and “the transportation I have meets my needs”, the two items would be viewed as demonstrating high internal consistency. At the measure level, internal consistency increases as items strongly relate to one another, or if more items targeting a similar topic are added to the measure. The RTC/OM used Cronbach’s alpha as a measure of internal consistency, which is the standard measure of internal consistency that estimates this type of reliability (Revellee & Condon, 2019). If Cronbach’s alpha is low, this indicates that scores on measures may not be reliable and scores across items are not closely related to one another.

An alternative way of thinking about internal consistency is that it is the extent to which all of the items of a test measure the same latent variable. These are variables (e.g., an HCBS outcome) that are not directly observed but rather inferred from other variables that are directly measured. The ideal of measurement is for all items of a measure to be assessing the same latent variable. A second statistic, coefficient omega may therefore be a more appropriate index of the extent to which items in an assessment measure the same latent variable (Revelle & Zinbarg, 2009).

Validity

As Sireci and Padilla (2014) state, validation refers to the “process of gathering, evaluating, and summarizing evidence to support the use of an assessment instrument for its intended purposes” (p. 97). A measure that is valid possesses evidence that the information it provides can be used for its intended purpose. A measure or test, in and of itself, is not valid or invalid. We do not validate measures, tests, or assessments. Rather we validate the proposed interpretations and uses of measure results. Validation is therefore the collection of evidence to support interpretations and uses of measure results (Haldyna & Rodriguez, 2013; Kane, 2013). A measure that has been shown to provide accurate and useful information for states to use in making HCBS policy decisions cannot be assumed to be accurate and useful when employed by provider agencies to make decisions about programming.  In a similar, manner a measure that was quite accurate in detailing the changes that occurred in the lives of people with intellectual and developmental disabilities when they moved from institutional to community settings cannot be assumed to be valid if used to make decisions about which community-based residential service agencies provide the best services and supports. Validity is a matter of degree and is established for a specific context. A measure with more high quality evidence of its validity is more accurate and applicable within the context where the evidence was gathered and for the purposes for which it is intended (AERA, APA, & NCME, 2014).

There are numerous types of validity and evidence for each form tends to come from different sources. Content validity refers to evidence that a measure is representative of or covers all aspects of the construct or domain one is attempting to assess. This is sometimes referred to as construct saturation. In order to be able to use the results of a measure of community inclusion in a valid manner, for example, the measure for this construct would need to include most if not all aspects of how people experience community inclusion. This would include, but is not limited to, the extent to which the person is able to access the community with the frequency that they desire, the degree to which they are able to engage in preferred activities that they view as meaningful, their sense of belonging while in the community, the social support they receive and are able to give to others, etc. A measure of community inclusion that focused solely on the frequency various community activities would have lower content validity than one that encompasses a greater number of aspects of inclusion.

A second type of validity is referred to as internal structure validity. When establishing this argument for validity, measure developers look for the extent to which a pattern of empirical relationships among a set of items matches the structure proposed by the measure developer’s theory. There are three basic aspects of internal structure: dimensionality, measurement invariance, and reliability. When assessing dimensionality, measure developers attempt to establish the extent to which the inter-relationships that exist among the items in the measure support the intended measure scores that are intended to be utilized to draw inferences. Measures that are designed to result in a single composite score should be unidimensional. Measurement invariance refers to a statistical property of a measure that indicates that it is measuring the same construct across different groups of interest or across time (Vandenberg & Lance, 2000). For example, measurement invariance can be used to study whether a given measure is interpreted in a conceptually similar manner by respondents representing different disabilities, genders, or cultural backgrounds. When the relationships that exist between scale items and subscales and the underlying construct are the same across groups and/or time, a measure can be said to have adequate measurement invariance (Rios & Wells, 2014).

Rios and Wells (2014) describe a variety of different statistical approaches that are be used to provide such evidence focusing on the usefulness of confirmatory factor analysis (CFA) for assessing the extent to which the structure of an assessment (i.e., how the items “fit” with each other) matches the theory behind the measure. Internal structure validity is likely to be higher when measure developers generate their assessment items based upon existing conceptual frameworks related to the construct in question that are supported by research.  Unfortunately, this has rarely been done in the development of HCBS outcome measures with the majority having little to no theoretical underpinnings.

A third form of validity is referred to as criterion or criterion-related validity. This aspect of validity provides an estimate of the degree to which one measure predicts an outcome for another measure or performance in another situation. Criterion validity is often established through an evaluation how closely the results of a measure in question correspond to the results of a different measure that serves as the criterion. This criterion needs to be a well-established and/or widely-used measure that possesses a considerable body of evidence that supports interpretations and uses of measure results (Haladyna & Rodriguez, 2013). The criterion in question can be in the past (postdictive validity), present (concurrent validity), or future (predictive validity). For example, an assessment of caregiver competencies strongly related to current satisfaction with services and supports on the part of HCBS recipients would provide evidence of good concurrent validity. If that same measure was found to also be closely associated with future outcomes with respect to social connectedness, employment success, and/or community inclusion it would be considered to be evidence of good predictive validity.

The difficulty faced in establishing the criterion-related validity of HCBS outcome measures is that there are currently few measures, if any, that have sufficient evidence to support their interpretation and use. In the absence of such a “criterion,” measure developers are often left to validate interpretations and uses of new measures against those that have little evidence to support their validity other than that they have been used for long periods of time.  Unfortunately, there is no such construct as “longevity validity” in measurement theory or psychometrics. As a result, there currently exists a need for measure developers to look at alternative ways to establish this form of validity through research efforts.  It should be noted that the validity of a measure is not ascertained on the basis of a single study or research effort, but rather, an accumulation of evidence over time related to a variety of sources of validity.

A fourth type of validity is referred to as construct validity. Construct validity reflects the appropriateness of inferences made on the basis of measurements, specifically whether an assessment measures the intended construct(s). A good measure would have been developed based on a theory or a sum of research findings that support the existence of a certain construct (e.g., employment). In addition, when other forms of validity (i.e., content, internal structure, and criterion) converge to support the cohesiveness and theoretical soundness of the measure, we can describe the measure and its results as possessing construct validity.  At the same time, the results of a good measure need to be able to discriminate between the construct intended to be measured and other unrelated constructs. To achieve construct validity of a measure can take years, if not decades, as much data is needed from a representative pool of participants as well as existing peer-reviewed literature support a certain construct, and the availability of a high-quality criterion within the same measurement domain.

A number of strategies can be used to support validity during the measure development process. Enhanced content validity can be supported through ensuring that the framework one is using (i.e., what are the most important things to measure) has been vetted by representative groups of stakeholders. When developing RTC/OM measures, the first step taken was to accumulate evidence with respect to the content validity of the National Quality Forum’s HCBS Outcome Measurement Framework (2016) through conducting a series of participatory planning and decision-making (PPDM) groups (Abery & Anderson, 2009; RTC/OM) 2016) with a nationally representative group of stakeholders. These groups included: HCBS recipients with a variety of disabilities, family members, policy makers, program administrators, and service providers. Through these groups, stakeholders were able to express their opinions with respect to the importance of measuring different NQF domains and subdomains. They also addressed the operational definitions of each domain and subdomain and had input in additional areas to add or remove from the framework. This process was a first step in leading to the development of measures that not only reflected the NQF framework but the perspectives of stakeholders as to the most important outcomes to measures.

A second step that can be taken to support content validity involves developing an in-depth understanding of current theory and research related to the constructs that will be measured.  This step in the measure development process is critical to ensure that the measures developed will include all relevant, critical aspects of the constructs of interest. AS part of the RTC/OM measure development process, systematic reviews of the existing theory and research within each of the domains and subdomains for which measures were to be developed was undertaken. This step in the development process increased the likelihood that the measures being developed adequately saturated the constructs of interest. Based on an examination of the existing literature, for example, the measure of social connectedness developed included items not only related to a person‘s friendship network but questions pertinent to the development of social capital and opportunities for not only receiving but providing social support to others as these have both been demonstrated to contribute to the degree to which people feel socially connected to their families, friends, and community. 

The use of technical expert panels (TEPs) is a third strategy that can be used to support the content validity of measures under development. These panels need to not only include “experts” in the constructs for which measures are being developed, but also individuals (e.g., persons with disabilities, family, providers) for which the measures are intended and whose lives they may potentially impact. Following the development of the initial set of items for RTC/OM measures technical expert panels were utilized to provide feedback on measure concepts and item content. A group of HCBS content and measurement experts, people from five different disability groups, HCBS providers, and family members examined and rated the relevancy, clarity, and importance of each item associated with the concept being measured as well as the understandability of each item for both interviewers and respondents. By involving a broad group of stakeholders in the field of HCBS measurement in this process, we increased the likelihood the measures will be high in feasibility and usability and provide valid information about the quality of HCBS and the outcomes associated with it.

The measures currently under development by the RTC/OM are designed to assess multiple factors (or constructs) as outlined in the NQF’s HCBS Outcome Measurement Framework (2016). Factor analysis will determine whether items from a single measure are closely related to one another, while as a whole being separate from items that compose other measures. For example, internal structure will be established for the RTC/OM Transportation measure if the items that compose the measure can be explained by one or a few factors. This implies that the items of which that measure is composed will be more closely related to each other and less closely related to the items of which other measures are composed. Internal structure supports the claim that each measure assesses a unique factor (or set of factors) separate from other measures. This is important for validity purposes because it is evidence that each measure provides unique information about an HCBS domain or subdomain.

The establishment of criterion-related validity requires an evaluation how closely the results of a measure under development correspond to the results of different, well-validated measures that serve as the criterion (Rodriguez, 2020). These can either be at the same time or a later date.  When the Wechsler Intelligence Scales were revised (WISC-IV), WAIS-IV), for example, its criterion-related validity was supported by correlations with other global measures of intelligence and achievement including earlier versions of the intelligence scale (WISC-III; WAIS-III) that had been well validated (Donders & Janke, WISC-IV 2008) as well as the Wechsler Individual Achievement Test (WIAT-II) with a variety of different subgroups including persons with TBI, IDD, and ASD (Canivez, 2013; Konold & Canivez, 2010; Mayes & Calhoun, 2008) 

Establishing adequate criterion-related validity in such a direct fashion is not  possible at the present time when it comes to HCBS outcome measures as there are no measures currently available that have sufficient evidence to serve as a criterion. Supporting the validity of new HCBS measures will therefore need to be more indirect and use a variety of methods in order to determine performance with respect to this psychometric property.  Studies comparing new measures to existing HCBS outcome surveys (e.g., The CQL’s Personal Outcome Measures, Pennsylvania’s IM4Q) that have some validity evidence to support their use is a start.  Additional research , however, will also be required to establish validity such as comparing the outcomes of people with and without disabilities and of different disability populations to each other, determining the degree to which a new measure is sensitive to change over time when planned (e.g. policy-changes) and unplanned alterations occur in the environment (e.g., during and post-pandemic). Comparisons across persons with disabilities who are experience varying level of support needs as well as degrees of person-centeredness in their services and support would also provide indirect evidence of validity.

Measuring construct validity in HCBS can also be challenging. One way to approach the issue is through ascertaining the internal structure of the measures. This evaluation is typically undertaken using confirmatory factor analysis. Factor analysis is a statistical method that tests whether responses to a set of items can be explained by a few common causes referred to as “factors.” It is used to evaluate the extent to which the measures of a construct are consistent with the developers understanding of the nature of that construct (or factor). As such, the objective of confirmatory factor analysis is to test whether the data fit a hypothesized measurement model. In addition, other validity evidence related to the measure in question needs to be gathered (e.g., content and criterion) in support of the cohesiveness and theoretical soundness of the measure.

Figure 2. Impact of Poor Measurement and HCBS System Entry

Figure 2. Describes the potential impact that measurement of Home and Community-Based Services System eligibility and support needs can have on the Home and Community-Based Services System. When valid and reliable measurement tools and practices are used to assess eligibility and support needs, the result enhances that the potential that the Home and Community-Based Services System at the federal, state, or provider level provides appropriate services and supports to recipients. Ultimately, this should lead to improvement in both the quality of services received and the personal outcomes experienced by individuals. Poor measurement of eligibility and support needs may lead to components of the Home and Community-Based Services System providing inappropriate services including the provision of too much or too little support to a recipient. This can lead to declining outcomes for the individual and have a similar impact on HCBS system outcomes.

The other type of measures HCBS recipients encounter are those designed to obtain information about the quality of services they receive and personal outcomes they experience. In this case, using a measure with poor (i.e., is not predictive of how a person is experiencing their life or evidences insufficient variability) or unknown psychometric characteristics can result in information that may be inconsistent (unreliable) and not accurate or meaningful (invalid). Policy makers and program administrators who use the results from such measures have an increased probability of drawing erroneous conclusions/interpretations and making less than  fully informed decisions about policies and programs (see Figure 3). An outcome domain that needs improvement may never be identified as needing change, a policy/program that actually improved outcomes no longer funded because the “data” erroneously indicate that it has no positive impact.

The impact of using measures that do not meet technical psychometric requirements stipulated by the AERA, APA, & NCME (2014) either for eligibility and services/support determination or with respect to HCBS outcome measurement are reciprocal. Inadequately identifying the support needs of someone will likely affect their service quality and outcomes. Moreover, if an individual inappropriately qualifies for HCBS based on a measure with poor psychometrics, limited resources are compromised and services and supports needed by others may be reduced. Even further, if poor measurement practices are employed, The Center for Medicaid and Medicare Services (CMS), State Departments of Human Services, and provider agencies will not have the quality of data needed in order to make informed decisions about compliance, policy, and programming. Decisions regarding quality improvement initiatives will be difficult given that those engaged in evaluation will not have the high quality information

Figure 3. Potential Impact of Measure Psychometric on HCBS Outcomes

Figure 3. Shows the potential impact of quality and outcome measurement on the Home and Community-Based Services System. Valid and reliable measurement results in usable information for policy and decision makers. This information can be used to make improvements that are more likely to result in higher quality services and improved personal outcomes.  When measures with low reliability and validity are used, inaccurate and inconsistent information results. Policy and decision makers who use this information are more likely to make poorly informed choices that may not result in improvements to the system or lead to positive, person-centered outcomes.

they need to target areas in which services and supports as well as the personal outcomes experienced by HCBS recipients are less than satisfactory. 

Conclusion

Two critically important types of psychometric evidence that should be carefully examined prior to selecting measures designed to evaluate outcomes associated with HCBS are reliability and validity. Reliability pertains to the degree of consistency of information gathered using an assessment tool or a measure. Validity refers to the degree to which a measure possesses evidence that the information it provides can be effectively used for its intended purpose(s). A measure or test, in and of itself, is not valid or invalid. Rather, we validate the proposed interpretations and uses of measure results.

Ultimately, whether a measure of HCBS outcomes is of “high-quality” psychometrically is a matter of degree. There is neither a perfect measure nor a perfect interpretation of the information gathered. Crucially, flawed information derived from poorly validated measures can be worse than none at all. Developing high-quality measures of HCBS outcomes takes time, effort, and funding, but can ultimately result in saving within each of these areas in a resource deficient system.

Recommendations

Recommendation 1: Sufficient training and resources need to be available to states and providers to educate their leadership with respect to the potential issues that exist with current measurement tools and the potential impact of the lack of psychometric evidence when data from these measures are used in decision-making.

Recommendation 2: HCBS policymakers, program administrators, and measure endorsement organizations should carefully review the psychometric properties of measurement tools and confirm there is sufficient evidence of reliability and validity to warrant their use. The quality of existing psychometric evidence supporting the use of measures should be considered first and foremost in endorsement and recommendation of existing tools.

Recommendation 3: The psychometric properties of newly developed HCBS outcomes measures should be thoroughly investigated by experts in the field and results published in both technical report formats and peer-reviewed journals prior to supporting their implementation as part of HCBS quality improvement and decision-making processes.

References

  • Abery, B. H., & Anderson, L. L. (2009). Identifying the critical factors associated with high quality healthcare coordination for adults with physical disabilities. (No. 1 Technical Report). Minneapolis, MN: University of Minnesota – Institute on Community Integration.

  • American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing (2014 ed.). Washington, DC: American Educational Research Association.

  • Bialosiewicz, S., Murphy, K., & Berry, T. (2013). Do our measures measure up? The critical role of measurement invariance. An Introduction to Measurement Invariance Testing: Resource Packet for Participants. Presented at the American Evaluation Association.

  • Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71(3), 425–440.

  • Canivez, G. L. (2013). Incremental criterion validity of WAIS–IV factor index scores: Relationships with WIAT–II and WIAT–III subtest and composite scores. Psychological Assessment, 25(2), 484.

  • Donders, J., & Janke, K. (2008). Criterion validity of the Wechsler Intelligence Scale for Children–Fourth Edition after pediatric traumatic brain injury. Journal of the International Neuropsychological Society, 14(4), 651–655.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Houseworth, J., Pettingell, S., Kramme, J., Tichá, R., & Hewitt, A. (2020). Predictors of annual and early turnover among direct support professionals: National Core Indicators Staff Stability Survey. Journal of Intellectual and Developmental Disabilities. In press.

  • Kane, M. (2013). Articulating a validity argument. In The Routledge handbook of language testing (pp. 48–61). Routledge.

  • Konold, T. R., & Canivez, G. L. (2010). Differential relationships between WISC-IV and WIAT-II scales: An evaluation of potentially moderating child demographics. Educational and Psychological Measurement, 70(4), 613–627.

  • Larson, S. A., Eschenbacher, H. J., Anderson, L. L., Pettingell, S., Hewitt, A., Sowers, M., & Bourne, M. L. (2018). In-home and residential long-term supports and services for persons with intellectual or developmental disabilities: Status and trends through 2016. . Minneapolis: University of Minnesota, Research and Training Center on Community Living, Institute on Community Integration.

  • Mayes, S. D., & Calhoun, S. L. (2008). WISC-IV and WIAT-II profiles in children with high-functioning autism. Journal of Autism and Developmental Disorders, 38(3), 428–439.

  • National Quality Forum (NQF). (2016). Quality in Home and Services to Support Community Living: Addressing Gaps in Performance Measurement. Retrieved from http://www.qualityforum.org/Publications/2016/09/Quality_in_Home_and_CommunityBased_Services_to_Support_Community_Living__Addressing_Gaps_in_Performance_Measurement.aspx

  • Rehabilitation Research and Training Center on Outcome Measurement. (RTC/OM. (2016). Stakeholder Input: Identifying Critical Domains and Subdomains of HCBS Outcomes. Retrieved from https://rtcom.umn.edu/phases/phase-1-stakeholder-input

  • Research and Training Center on HCBS Outcome Development . (2016). Establishing the content validity of the National Quality Forum’s framework for HCBS Outcome Measurement: A participatory Decision-Making Approach.  Technical Report #1. Minneapolis, MN: University of Minnesota – Institute on Community Integration RTC/OM.

  • Revelle, W., & Condon, D. M. (2019). Reliability from α to ω: A tutorial. Psychological Assessment, 31(12), 1395–1411. https://doi.org/10.1037/pas0000754

  • Rios, J., & Wells, C. (2014). Validity evidence based on internal structure. Psicothema, 26(1), 108–116.

  • Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88(2), 413.

  • Sireci, S., & Padilla, J. L. (2014). Validating assessments: introduction to the special section. Psicothema, 26(1), 97–99.

  • Streiner, D. L. (2003). Starting at the beginning: an introduction to coefficient alpha and internal consistency. Journal of Personality Assessment, 80(1), 99–103.

  • Vandenberg, R. J., & Lance, C. E. (2000). A Review and Synthesis of the Measurement Invariance Literature: Suggestions, Practices, and Recommendations for Organizational Research. Organizational Research Methods, (3), 4–70.

  • Weir, J. P. (2005). Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. The Journal of Strength & Conditioning Research, 19(1), 231–240.