## Chapter 5Reliability

Too much consistency is as bad for the mind as it is for the body. Consistency is contrary to nature, contrary to life. The only completely consistent people are the dead.

— Aldous Huxley

Consistency is the hallmark of the unimaginative.

— Oscar Wilde

### Introduction

From the perspective of the creative thinker or innovator, consistency can be viewed as problematic. Consistent thinking leads to more of the same, as it limits diversity and change. On the other hand, inconsistent thinking or thinking outside the box produces new methods and ideas, inventions and breakthroughs, leading to innovation and growth.

Standardized tests are designed to be consistent, and, by their very nature, they are poor measures of creative thinking. In fact, the construct of creativity is one of the more elusive in educational and psychological testing. Although published tests of creative thinking and problem solving exist, administration procedures are complex and consistency in the resulting scores can be, unsurprisingly, very low. Creativity seems to involve an inconsistency in thinking and behavior that is challenging to measure reliably.

From the perspective of the test maker or test taker, which happens to be the perspective we’ll take in this course, consistency is critical to valid measurement. An inconsistent or unreliable test produces unreliable results that only inconsistently support the intended inferences of the test. Nearly a century of research provides us with a framework for examining and understanding the reliability of test scores, and, most importantly, how reliability can be estimated and improved.

This chapter introduces reliability within the framework of the classical test theory (CTT) model. In Chapter 8, we’ll learn about reliability within the item response theory model. Both CTT and IRT involve measurement models, sometimes referred to as latent variable models, which are used to describe the construct or constructs assumed to underly responses to test items.

This chapter starts with a general deﬁnition of reliability in terms of consistency of measurement. The CTT model and assumptions are then presented in connection with statistical inference and measurement models, as discussed in Chapter 0. Reliability and unreliability, that is, the standard error of measurement, are discussed as products of CTT. Finally, the four main study designs and corresponding methods for estimating reliability are reviewed.

### Consistency of Measurement

In educational and psychological testing, reliability refers to the precision of the measurement process, or the consistency of scores produced by a test. Reliability is a prerequisite for validity. That is, for scores to be valid indicators of the intended inferences or uses of a test, they must ﬁrst be reliable or precisely measured. However, precision or consistency in test scores does not necessarily indicate validity.

A simple analogy may help clarify the distinction between reliability and validity. If the testing process is represented by an archery contest, where the test taker, an archer, gets multiple attempts to hit the center of a target, each arrow could be considered a repeated measurement of the construct, archery ability. Imagine someone whose arrows all end up within a few millimeters of one another, tightly bunched together, but all stuck in the trunk of a tree standing behind the target itself. This represents consistent but inaccurate measurement. On the other hand, consider another archer whose arrows are scattered around the target, with one hitting close to the bullseye and the rest spread widely around it. This represents inconsistent and inaccurate measurement. Reliability and validity are both present only when the arrows are all close to the center of the target. In that case, we’re consistently measuring what we intend to measure.

A key assumption in this analogy is that our archers are actually skilled, and any errors in their shots are due to the measurement process itself. Instead, consistently hitting a nearby tree may be evidence of a reliable test given to someone who is simply missing the mark. In reality, if someone scores systematically oﬀ target or above or below their true underlying ability, we have a hard time attributing this to bias in the testing process versus a true diﬀerence in ability. A key point here is that evidence supporting the reliability of a test can be based on results from the test itself. However, evidence supporting the validity of the test must come, in part, from external sources. The only ways to determine that consistently hitting a tree represents low ability are to (a) conﬁrm that our test is unbiased and (b) conduct a separate test. These are validity issues, which will be covered in Chapter 9.

Consider the reliability of other familiar physical measurements. One common example is measuring weight. How can measurements from a basic ﬂoor scale be both unreliable and invalid? Think about the potential sources of unreliability and invalidity. For example, consider measuring the weight of a young child every day before school. What is the variable we’re actually measuring? How might this variable change from day to day? And how might these changes be reﬂected in our daily measurements? If our measurements change from day to day, how much of this change can be attributed to actual changes in weight versus extraneous factors such as the weather or glitches in the scale itself?

In both of these examples, there are multiple interrelated sources of potential inconsistency in the measurement process. These sources of score change can be grouped into three categories. First, the construct itself may actually change from one measurement to the next. For example, this occurs when practice eﬀects or growth lead to improvements in performance over time. Archers may shoot more accurately as they calibrate their bow or readjust for wind speed with each arrow. As we’ll discuss below, students may learn or at least be refreshed in the content of a test as they take it.

Second, the testing process itself could diﬀer across measurement occasions. Perhaps there is a strong cross-breeze one day but not the next. Or maybe the people oﬃciating the competition allow for diﬀerent amounts of time for warm-up. Or maybe the audience is rowdy or disagreeable at some points and supportive at others. These factors are tied to the testing process itself, and they may all lead to changes in scores.

Finally, our test may simply be limited in scope. Despite our best eﬀorts, it may be that arrows diﬀer from one another to some degree in balance or construction. Or it may be that archers’ ﬁngers occasionally slip for no fault of their own. By using a limited number of shots, scores may change or diﬀer from one another simply because of the limited nature of the test. Extending the analogy to other sports, football and basketball each involve many opportunities to score points that could be used to represent the ability of a player or team. On the other hand, because of the scarcity of goals in soccer, a single match may not accurately represent ability, especially when the referees have been bribed!

### Classical Test Theory

#### The model

Now that we have a deﬁnition of reliability, with examples of the inconsistencies that can impact test scores, we can establish a framework for estimating reliability based on the changes or variability that occur in a set of test scores. Recall from Chapter 0 that statistics allow us to make an inference from (a) the changes we observe in our test scores to (b) what we assume is the underlying cause of these changes, the construct. Reliability is based on and estimated within a simple measurement model that decomposes an observed test score ($X$) into two parts, truth ($T$) and error ($E$):

 $X=T+E$ (5.1)

Note that $X$ in Equation 5.1 is a composite consisting of two component scores. The true score $T$ is the construct we’re intending to measure. We assume that $T$ is impacting our observation in $X$. The error score $E$ is everything randomly unrelated to the construct we’re intending to measure. Error also directly impacts our observation.

To understand the CTT model in Equation 5.1, we have to understand the following question: how would an individuals observed score $X$ vary across multiple repeated administrations of a test? If you took a test every day for the next 365 days, and somehow you forgot about all of the previous administrations of the test, would your observed score change from one testing to the next? And if so, why would it change?

CTT answers this question by making two key assumptions about the variability and covariability of $T$ and $E$. First, your true underlying ability $T$ is assumed to be constant. If your true ability can be expressed as 20 of 25 possible points, every time you take the test your true score will consistently be 20. It will not change. Second, any error that inﬂuences your observed score at a given administration is assumed to be completely random, and thus unrelated to your true score and to any other error score for another administration.

So, at one administration of the test, some form of error may cause your score to decrease by two points. Maybe you weren’t feeling well that day. In this case, knowing that $T=20$, what is $E$ in Equation 5.1, and what is $X$? At another administration, you might guess correctly on a few of the test questions, resulting in an increase of 3 based solely on error. What is $E$ now? And what is $X$?

Solving for $E$ in equation 5.1 clariﬁes that random error is simply the diﬀerence between the true score and the observation, where a negative error always indicates that $X$ is too low and a positive error always indicates that $X$ is too high:

 $E=X-T$ (5.2)

So, having a cold produces $E=-2$ and $X=18$, compared to your true score of 20. And guessing correctly produces $E=3$ and $X=23$.

According to CTT, over inﬁnite administrations of a test without practice eﬀects, your true score will always be the same, and error scores will vary completely randomly, some being positive, others being negative, but being on average zero. Given these assumptions, what should your average observed score be across inﬁnite administrations of the test? And what should be the standard deviation of your observed score over these inﬁnite observed scores? In responding to these questions, you don’t have to identify speciﬁc values, but you should instead reference the means and standard deviations that you’d expect $T$ and $E$ to have. Remember that $X$ is expressed entirely as a function of $T$ and $E$, so we can derive properties of the composite from its components.

Here is an explanation of these questions. We know that your average observed score would be your true score. Error, because it varies randomly, would cancel itself out in the long run, and your mean $X$ observed score would simply be $T$. The standard deviation of these inﬁnite observed scores $X$ would then be entirely due to error. Since truth does not change, any change in observed scores must be error variability. This standard deviation is referred to as the standard error of measurement (SEM), discussed more below. Although it is theoretically impossible to obtain the actual SEM, since you can never take a test an inﬁnite number of times, we can estimate SEM using data from a sample of test takers. And, as well see below, reliability will be estimated as the opposite of measurement error.

Figure 5.1 presents $X$, $T$, and $E$ visually for a ﬁctitious sample of test takers. On the $x$-axis are the true scores and on the $y$-axis are observed scores. Make sure you understand where $X$, $T$, and $E$ are all located in this plot.

#### Applications of the model

Lets think about some speciﬁc examples now of the classical test theory model. Go back to the construct of interest you identiﬁed in previous chapters. Consider how this construct is operationalized, and the kind of measurement scale that results from it. Consider the possible score range, and try to articulate $X$ and $T$ in your own example.

Next, lets think about $E$. What might cause an individuals observed score $X$ to diﬀer from their true score $T$ in this situation? Think about the conditions in which the test would be administered. Think about the population of students, patients, individuals that you are working with. Would they tend to bring some form of error or unreliability into the measurement process?

Heres a simple example involving preschoolers. As I mentioned in earlier chapters, some of my research involves measures of early literacy, for example, with the IGDIs. In this research, we test childrens phonological awareness by presenting them with a target image, for example, an image of a star, and asking them to identify the image among three response options that rhymes with the target image. So, wed present the images and say, Which one rhymes with star? Then, for example, children might point to the image of a car.

Measurement error is problematic in a test like this for a number of reasons. First of all, preschoolers are easily distracted. Even with standardized one-on-one test administration apart from the rest of the class, children can be distracted by a variety of seemingly innocuous features of the administration or environment, from the chair theyre sitting in, to the zipper on their jacket. In the absence of things in their environment, theyll tell you about things from home, what they had for breakfast, what they did over the weekend, or, as a last resort, things from their imagination. Second of all, because of their short attention span, the test itself has to be brief and simple to administer. Shorter tests, as mentioned above in terms of archery and other sports, are less reliable tests; fewer items makes it more diﬃcult to identify the reliable portion of the measurement process. In shorter tests, problems with individual items have a larger impact on the test as a whole.

Think about what would happen to $E$ and the standard deviation of $E$ if a test were very short, perhaps including only ﬁve test questions. What would happen to $E$ and its standard deviation if we increased the number of questions to 200? What might happen to $E$ and its standard deviation if we administered the test outside? These are the types of questions we will answer by considering the speciﬁc sources of measurement error and the impact we expect them to have, whether systematic or random, on our observed score.

#### Systematic and random error

Next, we need to think about whether the errors for our given testing scenarios are systematic or random. A systematic error is one that inﬂuences a persons score in the same way at every repeated administration of a test. A random error is one that could be positive or negative for a person, one that changes randomly by administration. In the preschooler literacy example, as students focus less on the test itself and more on their surroundings, their scores might involve more guessing, which introduces random error if the guessing is truly random. Interestingly, we noticed in pilot studies of the IGDIs that students tended to choose the ﬁrst option when they didnt know the correct response. This resulted in a systematic change in their scores based on how often the correct response happened to be ﬁrst.

Distinguishing between systematic and random error can be diﬃcult. Some features of a test or test administration can produce both types of error. A popular example of systematic versus random error is demonstrated by a faulty ﬂoor scale. Revisiting the example from above, suppose I measure my oldest sons weight every day for two weeks as soon as he gets home from school. Note that my oldest is nine years old. Suppose also that his average weight across the two weeks was 55 pounds, but that this varied with a standard deviation of 5 pounds. Think about some reasons for having such a large standard deviation. What could cause my sons weight, according to a ﬂoor scale, to diﬀer from his true weight at a given measurement? What about his clothing? Or how many toys are in his pockets? Or how much food he ate for lunch?

What type of error does the standard deviation not capture? You should know the answer to this question. Systematic error doesnt vary from one measurement to the next. If the scale itself is not calibrated correctly, for example, it may overestimate or underestimate weight consistently from one measure to the next. The important point to remember here is that only one type of error is captured by $E$ in CTT: the random error. Any systematic error that occurs consistently across administrations will become part of $T$, and will not reduce our estimate of reliability.

### Reliability and Unreliability

#### Reliability

Figure 5.2 contains a plot similar to the one in Figure 5.1 where we identiﬁed $X$, $T$, and $E$. This time, we have scores on two forms of $X$, labeled here ${X}_{1}$ and ${X}_{2}$, and we’re going to focus on the overall distances of the points from the line that goes diagonally across the plot. Once again, this line represents truth. A person with a true score of 10 on ${X}_{1}$ will get a 10 on ${X}_{2}$, based on the assumptions of the CTT model. The points themselves represent observed scores. So, the highest scores, in the upper right of the plot, appear to be about 16 on form 1 and 13 on form 2. Which of these is this persons true score? We dont actually know! Without administering a test an inﬁnite number of times, and then taking the average, we can never know a persons true score. So, in this case, we dont know which of these, 16 or 13, is more or less erroneous. Maybe the true score is 14.5? But maybe its 10?

The assumptions of CTT make it possible for us to estimate the reliability of a test based on scores for a sample of individuals. This plot shows scores on two test forms, and the oval that is superimposed over the scores gives us an idea of the linear relationship between them. There appears to be a strong, positive, linear relationship. Thus, people tend to score similarly from one form to the next. The correlation coeﬃcient for this data set, $\rho =0.80$, gives us an estimate of how similar scores are, on average from form 1 to form 2. Because the correlation is positive and strong for this plot, we would expect a persons score to be pretty similar from one testing to the next. Thus, for the person with scores of 16 and 13, wed expect their true score to be close by.

Imagine if the scatter plot were instead nearly circular, with no clear linear trend from one test form to the next. The correlation in this case would be near zero. Would we expect someone to receive a similar score from one test to the next? On the other hand, imagine a scatter plot that falls perfectly on the line. If you score, e.g., 12 on one form, you also score 12 on the other. The correlation in this case would be 1. Would we expect scores to remain consistent from one test to the next?

We’re now ready for a statistical deﬁnition of reliability. In CTT, reliability is deﬁned as the proportion of variability in $X$ that is due to variability in true scores $T$:

 $r=\frac{{\sigma }_{T}^{2}}{{\sigma }_{X}^{2}}.$ (5.3)

Note that true scores are assumed to be constant in CTT for a given individual, but not across individuals. Thus, reliability is deﬁned in terms of variability in scores for a population of test takers. Why do some individuals get higher scores than others? In part because they actually have higher abilities or true scores than others, but also, in part, because of measurement error. The reliability coeﬃcient in Equation 5.3 tells us how much of our observed variability is due to true score diﬀerences.

#### Estimating reliability

Unfortunately, we can’t ever know the true scores for test takers. So we have to estimate reliability indirectly. One indirect estimate made possible by CTT is the correlation between scores on two forms of the same test, as shown in Figure 5.2. Thus, utilizing Equation A.5 from Appendix A:

 $r={\rho }_{{X}_{1}{X}_{2}}=\frac{{\sigma }_{{X}_{1}{X}_{2}}}{{\sigma }_{{X}_{1}}{\sigma }_{{X}_{2}}}.$ (5.4)

This correlation is estimated as the covariance, or the shared variance between the distributions on two forms, divided by a product of the standard deviations, or the total available variance within each distribution.

There are other methods for estimating reliability from a single form of a test. The only ones presented here are split-half reliability and coeﬃcient alpha. Split-half is only presented because of its connection to what’s called the Spearman-Brown reliability formula. The split-half method predates coeﬃcient alpha, and is computationally simpler. It takes scores on a single test form, and separates them into scores on two halves of the test, which are treated as separate test forms. The correlation between these two halves then represents an indirect estimate of reliability, based on Equation 5.3.

The Spearman-Brown formula was originally used to correct for the reduction in reliability that occurred when correlating two test forms that were only half the length of the original test. In theory, reliability will increase as we add items to a test. Thus, Spearman-Brown is used to estimate, or predict, what the reliability would be if the half-length tests were made into full-length tests. The formula also has other practical uses. Today, it is most commonly used to predict how reliability would change if a test form were reduced or increased in length. For example, if you are developing a test and you gather pilot data on 20 test items with a reliability estimated of 0.60, Spearman-Brown can be used to predict how this reliability would go up if you increased the test length to 30 or 40 items.

The Spearman-Brown reliability, ${r}_{new}$, is estimated as a function of what’s labeled here as the old reliability ${r}_{old}$ and the factor by which the length of $X$ is predicted to change, $k$:

 ${r}_{new}=\frac{k{r}_{old}}{\left(k-1\right){r}_{old}+1}$ (5.5)

Again, $k$ is the factor by which the test length is increased or decreased. It is equal to the number of items in the new test divided by the number of items in the original test. Multiply $k$ by the old reliability, and then divided the result by $\left(k-1\right)$ times the old reliability, plus 1. For the example mentioned above, going from 20 to 30 items, we have $\left(30∕20×0.60\right)$ divided by $\left(30∕201\right)×0.60+1=.69$. Going to 40 items, we have a new reliability of 0.75.

Alpha is arguably the most popular form of reliability. Many people refer to it as “Chronbach’s alpha,” but Chronbach himself never intended to claim authorship for it and in later years he regretted the fact that it was attributed to him (see Cronbach & Shavelson, 2004). The popularity of alpha is due to the fact that it can be calculated using scores from a single test form, rather than two separate administrations or split halves. Splitting a test equally can be diﬃcult, because the split-half reliability will be impacted by how similar the two chosen half-tests are. Administering a test twice can also be challenging. Coeﬃcient alpha avoids these problems by estimating reliability using the item responses themselves as many miniature versions of the total test. Alpha is deﬁned as

 $r=\alpha =\left(\frac{J}{J-1}\right)\left(\frac{{\sigma }_{X}^{2}-\sum \underset{{X}_{j}}{\overset{2}{\sigma }}}{{\sigma }_{X}^{2}}\right),$ (5.6)

where $J$ is the number of items on the test, ${\sigma }_{X}^{2}$ is the variance of observed total scores on $X$, and $\sum {\sigma }_{{X}_{j}}^{2}$ is the sum of variances for each item $j$ on $X$. To see how it relates to the CTT deﬁnition of reliability in Equation 5.3, consider the top of the second fraction in Equation 5.6. The total test variance ${\sigma }_{X}^{2}$ captures all the variability available in the total scores for the test. Were subtracting from it the variances that are unique to the individual items themselves. Whats left over? Only the shared variability among the items in the test. We then divide this shared variability by the total available variability. Within the formula for alpha you should see the general formula for reliability, true variance over observed.

Keep this in mind: alpha is an estimate of reliability, just like the correlation is. So, any equation requiring an estimate of reliability, like SEM below, can be computed using either a correlation coeﬃcient or an alpha coeﬃcient. Students often struggle with this point: correlation is one estimate of reliability, alpha is another. Theyre both estimating the same thing, but in diﬀerent ways based on diﬀerent reliability study designs.

#### Unreliability

Now that we’ve deﬁned reliability in terms of the proportion of observed variance that is true, we can deﬁne unreliability as the portion of observed variance that is error. This is simply 1 minus the reliability:

 $1-r=\frac{{\sigma }_{E}^{2}}{{\sigma }_{X}^{2}}.$ (5.7)

Typically, we’re more interested in how the unreliability of a test can be expressed in terms of the available observed variability. Thus, we multiply the unreliable proportion of variance by the standard deviation of $X$ to obtain the SEM:

 $SEM={\sigma }_{X}\sqrt{1-r}$ (5.8)

The SEM is the average variability in observed scores attributable to error. As any statistical standard error, it can be used to create a conﬁdence interval around the statistic that it estimates, that is, $T$. Since we don’t have $T$, we instead create the conﬁdence interval around $X$ to index how conﬁdent we are that $T$ falls within it for a given individual. For example, the verbal reasoning subtest of the GRE is reported to have a reliability of 0.93 and an SEM of 2.2, on a scale that ranges from 130 to 170. Thus, an observed verbal reasoning score of 155 has a 95% conﬁdence interval of about $±4.2$ points. At $X=155$, we are 95% conﬁdent that the true score falls somewhere between 150.8 and 159.2. Note that the reliability for verbal reasoning on the GRE is actually estimated using IRT, as discussed in Chapter 8.

#### Interpreting reliability and unreliability

There are no agreed-upon standards for interpreting reliability coeﬃcients. Note that reliability is bound by 0 on the lower end and 1 at the upper end, because, by deﬁnition, the amount of true variability can never be less or more than the total available variability in $X$. Higher reliability is clearly better, but cutoﬀs for acceptable levels of reliability vary for diﬀerent ﬁelds, situations, and types of tests. The stakes of a test are an important consideration when interpreting reliability coeﬃcients. The higher the stakes, the higher we expect reliability to be. Otherwise, cutoﬀs depend on the particular application.

For purposes of this course, which focuses on educational and psychological measurement, we’ll use the scale presented in Table 5.1. These guidelines apply to medium-stakes tests, where a reliability of 0.70 is sometimes considered minimally acceptable, 0.80 is decent, 0.90 is quite good, and anything above 0.90 is excellent. High stakes tests should have reliabilities at or above 0.90. Low stakes tests, which are often simpler and shorter than higher-stakes ones, often have reliabilities around 0.70.

A few additional considerations are necessary when interpreting coeﬃcient alpha. First, alpha assumes that all items measure the same single construct. Items are also assumed to be equally related to this construct, that is, they are assumed to be parallel measures of the construct. When the items are not parallel measures of the construct, alpha is considered a lower-bound estimate of reliability, that is, the true reliability for the test is expected to be higher than indicated by alpha. Finally, alpha is not a measure of dimensionality. It is frequently claimed that a strong coeﬃcient alpha supports the unidimensionality of a measure. However, alpha does not index dimensionality. It is impacted by the extent to which all of the test items measure a single construct, but it does not necessarily go up or down as a test becomes more or less unidimensional.

Table 5.1: General Guidelines for Interpreting Reliability Coeﬃcients
 Interpretation Reliablity ($r$) High Stakes Low Stakes $\ge 0.90$ Excellent Excellent $0.80\le r<0.90$ Good Excellent $0.70\le r<0.80$ Acceptable Good $0.60\le r<0.70$ Borderline Acceptable $0.50\le r<0.60$ Low Borderline $0.20\le r<0.50$ Unacceptable Low $0.00\le r<0.20$ Unacceptable Unacceptable

### Reliability Study Designs

Now that we’ve established the major estimates of reliability and unreliability, we can ﬁnally discuss the four main study designs that allow us to collect data for our estimates. These designs are referred to as internal consistency, equivalence, stability, and equivalence/stability designs. Each design produces a corresponding type of reliability that is expected to be impacted by diﬀerent sources of measurement error.

The four standard study designs vary in the number of test forms and the number of testing occasions involved in the study. Until now, weve been talking about using two test forms on two separate administrations. This study design is found in the lower right corner of Table 5.2, and it provides us with an estimate of equivalence (for two diﬀerent forms of a test) and stability (across two diﬀerent administrations of the test). This study design has the potential to capture the most sources of measurement error, and it can thus produce the lowest estimate of reliability, because of the two factors involved. The more time that passes between administrations, and as two test forms diﬀer more in their content and other features, the more error we would expect to be introduced. On the other hand, as our two test forms are administered closer in time, we move from the lower right corner to the upper right corner of Table 5.2, and our estimate of reliability captures less of the measurement error introduced by the passage of time. Were left with an estimate of the equivalence between the two forms.

As our test forms become more and more equivalent, we eventually end up with the same test form, and we move to the ﬁrst column in Table 5.2, where one of two types of reliability is estimated. First, if we administer the same test twice with time passing between administrations, we have an estimate of the stability of our measurement over time. Given that the same test is given twice, any measurement error will be due to the passage of time, rather than diﬀerences between the test forms. Second, if we administer one test only once, we no longer have an estimate of stability, and we also no longer have an estimate of reliability that is based on correlation. Instead, we have an estimate of what is referred to as the internal consistency of the measurement. This is based on the relationships among the test items themselves, which we treat as miniature alternate forms of the test. The resulting reliability estimate is impacted by error that comes from the items themselves being unstable estimates of the construct of interest.

Table 5.2: Four Main Reliability Study Designs
 1 Form 2 Forms 1 Occasion Internal Consistency Equivalence 2 Occasions Stability Equivalence and Stability

Note that internal consistency reliability is estimated using either coeﬃcient alpha or split-half reliability. All the remaining cells in Table 5.2 involve estimates of reliability that are based on correlation coeﬃcients.

### Summary and Homework

This chapter provided an overview of reliability within the framework of CTT. After a general deﬁnition of reliability in terms of consistency, the CTT model and assumptions were presented in connection, along with examples of random and systematic error as deﬁned by CTT. Reliability and unreliability were discussed as products of CTT. Lastly, the four main study designs were reviewed. In Chapter 6, we’ll cover a speciﬁc implementation of reliability to situations were scores come not from two test forms but from two judges or raters providing scores to a single group of test takers.

#### Learning objectives

1.
Deﬁne reliability, including potential sources of reliability and unreliability in measurement, using examples.
2.
Describe the simplifying assumptions of the classical test theory (CTT) model and how they are used to obtain true scores and reliability.
3.
Identify the components of the CTT model ($X$, $T$, and $E$) and describe how they relate to one another, using examples.
4.
Describe the diﬀerence between systematic and random error, including examples of each.
5.
Explain the relationship between the reliability coeﬃcient and standard error of measurement, and identify how the two are distinguished in practice.
6.
Calculate the standard error of measurement and describe it conceptually.
7.
Compare and contrast the three main ways of assessing reliability, test-retest, parallel-forms, and internal consistency, using examples, and identify appropriate applications of each.
8.
Compare and contrast the four reliability study designs, based on 1 to 2 test forms and 1 to 2 testing occasions, in terms of the sources of error that each design accounts for, and identify appropriate applications of each.
9.
Use the Spearman-Brown formula to predict change in reliability.
10.
Describe the formula for coeﬃcient alpha, the assumptions it is based on, and what factors impact it as an estimate of reliability.
11.
Estimate diﬀerent forms of reliability using statistical software, and interpret the results.
12.
Describe factors related to the test, the test administration, and the examinees, that aﬀect reliability.

#### Discussion Questions

1.
Explain the CTT model and its assumptions using the archery example presented at the beginning of the chapter.
2.
Suppose you want to reduce the SEM for a ﬁnal exam in a course you are teaching. Identify three sources of measurement error that could contribute to the SEM, and three that could not. Then, consider strategies for reducing error from these sources.
3.
Why do we expect reliability to increase as a test gets longer, as in the Spearman-Brown formula? Use examples to explain the rationale that this prediction is based on.
4.
Dr. Phil is developing a measure of relationship quality to be used in counseling settings with couples. He intends to administer the measure to couples multiple times over a series of counseling sessions. Describe an appropriate study design for examining the reliability of this measure.
5.
More and more TV shows lately seem to involve people performing some talent on stage and then being critiqued by a panel of judges, one of whom is British. Describe the “true score” for a performer in this scenario, and identify sources of measurement error that could result from the judging process, including both systematic and random sources of error.