How do we assess reliability and validity?

We can assess reliability by four ways:

  1. Test-retest reliability measures test consistency by giving the same test twice to the same people to see if the scores are the same. 

Some conditions need to be fulfilled in the repetition of measurement, such as same location; repetition over a short period of time; same administration procedures. However, it raises concerns for educational testing which is the practice effect.

The correlation between two sets of scores is used as the reliability index:

  • Pearson correlation can be used if assumptions are met.
  • Spearman’s rho (non-parametric; when data are not normal)
  • Kendall’s tau (non-parametric; when variables are at least ordinal)

2. Parallel forms reliability

Parallel/Alternative/Equivalent forms should be built based on the same test specifications but contain different items. For instance, having same set of domains, or same types of questions (multiple-choice vs. essay types) or corresponding questions at the same difficulty level.

The administration of the tests can be counterbalanced to minimize the variation due to environmental factors.

  • One subgroup: A to B
  • The other subgroup: B to A

Correlation between two forms is used as the reliability index.

3. Split-half reliability

The correlation between two separate half-length tests is used to estimate the reliability.

  • For example, you obtained the correlation between two-halves is .60. Then we can compute the reliability of scores on the total test.
  • This is the Spearman-Brown prophecy formula.

3. Internal consistency reliability

The idea is that each item in a test can be considered as a one-item test. The total test of n items is seen as a set of n parallel tests. Then we estimate the reliability depending on the consistency of each person’s performance from item to item.

  • Variance of the total test scores
  • Variance of the individual item scores
  • Number of items (n)

This is called the Coefficient Alpha, also known as Cronbach Alpha. The coefficient alpha is interpreted asthe degree to which all of the items measure a common construct

The Kuder-Richardson formula 20 (K 20) is used when each item is scored dichotomously (either 0 or 1), the item variance (for a Bernoulli distribution) can be expressed as

pi= Proportion of correct responses

qi = 1 – pi ; the proportion of incorrect responses

Then the equation becomes:

  • This is called Kuder-Richardson Formula 20 (KR-20).
  • Both coefficient alpha and KR-20 measure the internal consistency.


  • Standards (2014): Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests.

Validity is composed of different forms→ the intended uses need to be justified from different aspects.

The three major ones:

  1. Content-related validity: Test content
  2. Criterion-related validity: Relations to other variables
  3. Construct-related validity: Internal structure

Content-Related Validity

  • It refers to an assessment of whether a test contains appropriate content and requires that appropriate processes be applied to that content. We need a specific explicit statement of what the test is intended to measure (Test Blueprint), to either assess the content validity of an existing test or construct a test that measures a particular set of contents
  • A test blueprint (also called table of specifications for the test) is an explicit plan that guides the test construction. E.g., An English literacy test.

➢ Description of content to be covered by the test.
➢ Specifications of cognitive processes in each content area.

Criterion-Related Validity:

The criterion-related validity focus on the degree to which it correlates with some chosen criterion measure of the same construct (relations to other variables). There are two broad classes of this validity form.

  • Predictive validity: if the test information is to be used to forecast future criterion performance.

Example: Use spelling test scores to predict reading test scores, the validity of the SAT scores for predicting First-Year Grades given high-school GPA.

  • Concurrent validity: whether the scores on the test correlate highly with scores obtained concurrently with another criterion.

Example: A new test vs. an old test measuring the same construct. Usually the scores on both tests are obtained at essentially the same time.

Construct-Related Validity:

construct validation requires collecting multiple types of evidence. Four commonly used approaches to construct valida­tion are:

  1. Provide correlational evidence showing that a construct has a strong relationship with certain variables and a weak relationship with other variables.

The valid measures of a construct will indicate that it should be strongly related to certain measures (Convergent validity), and it should be weakly related to others (Discriminant validity).An explicit method for studying the patterns of high and low correlations among a set of measures is called the analysis of Multi- Trait Multi-Method (MTMM) matrix of correlations.

  1. Show that certain groups obtain higher scores than other groups, with the high- and low-scoring groups being determined on logical grounds prior to the test administration. If a theory suggests that certain groups should possess an especially high or low level of a trait and, consequently, should score exceptionally high or low on a test measuring that trait, construct validity can be assessed based on predictions about group differences.
  2. Study the construct that underly performance (i.e., scores) on a test using factor analysis.

The factor analysis investigates the construct validity from the perspective of examining the Internal structure of the construct. It investigates if the items “hang together” to measure the construct. The two primary classes of factor analytic methods are exploratory factor analysis (EFA) and confirmatory factor analysis (CFA).

Exploratory factor analysis (EFA):

EFA explores factor structures without a consideration of the theoretical expectations of the researcher, even when such expectations are available.

  1. An exploratory tool to understand the underlying structure of a construct
  2. Explore the number of dimensions/factors underly the performance (i.e., scores)
  3. Explore which set of items “hang together” to measure each dimension.

Confirmatory factor analysis (CFA)

CFA is used to validate a pre-specified structure and to quantify the fit of each model to the data. In EFA, a single model is tested, but CFA can readily be used to test several competitive models and compare the fit among the models. It is strongly encouraged to test all plausible models using CFA and report which model fits better than others based on fit indices.

Useful resources:

Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications, Journal of Applied Psychology, 78(1), 98-104.

Streiner, D. L. (2003). Being inconsistent about consistency: When coefficient alpha does and doesn’t matter. J Pers Assess, 80(3), 217-222.


Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Harcourt Brace.