Chapter 8
Item Response Theory

One could make a case that item response theory is the most important statistical method about which most of us know little or nothing.


— David Kenny

Introduction

Item response theory (IRT) is arguably one of the most influential developments in the field of educational and psychological measurement. IRT provides a foundation for statistical methods that are utilized in contexts such as test development, item analysis, equating, item banking, and computerized adaptive testing. Its applications also extend to the measurement of a variety of latent constructs in a variety of disciplines.

The chapter begins with a comparison of IRT with classical test theory (CTT), including a discussion of their strengths and weaknesses and some common uses of each. Next, the traditional dichotomous IRT models are introduced with definitions of key terms and a comparison based on assumptions, benefits, limitations, and common uses. Finally, details are provided on some common applications of IRT in item analysis, test development, item banking, and computer adaptive testing.

IRT versus CTT

Reviewing CTT

Since its development in the 1950s and 1960s (Lord, 1952; Rasch, 1960), IRT has become the preferred statistical methodology for item analysis and test development. The success of IRT over its predecessor CTT comes primarily from the focus on IRT on the individual components that make up a test, that is, the test items themselves. By modeling outcomes at the item level, rather than at the test level as in CTT, IRT is more complex but also more comprehensive in terms of the information it provides about test performance.

As reviewed in Chapter 5, CTT gives us a model for the observed total score X. Recall from Equation 5.1 that the CTT model decomposes X into two parts, truth (T) and error (E):

X = T + E (8.1)

The true score T is the construct we’re intending to measure, and we assume it plays some systematic role in causing people to obtain observed scores on X. The error E is everything randomly unrelated to the construct we’re intending to measure. It also has a direct impact on X. The two item statistics that come from CTT are the mean performance on a given item, referred to as a p-value for dichotomous items, and the (corrected) item-total correlation for an item. At this point, you should be familiar with these two statistics, item difficulty and item discrimination, how they are related, and what they tell us about the items in a test.

It should be apparent that CTT is a simple model of test performance. The observed score X is assumed to consist of two unrelated parts: truth, or that which is consistent across infinite hypothetical testing administrations, and error, or that which varies randomly across infinite hypothetical testing administrations. Anything consistent from one test administration to the next is automatically captured by truth. If an examinee consistently cheats, her or his ability would be incorrect (too high), but, according to this model we will have no way of knowing it.

The simplicity of the CTT model brings up its main limitation: the score scale is dependent on the items in the test and the people taking the test. CTT is referred to as sample and item/test dependent because 1) any X, T, or E that you obtain for a particular test taker only has meaning within the test she or he took, and 2) any item difficulty or discrimination statistics you estimate only have meaning within a particular sample of test takers. So, person (ability) parameters (i.e., T) are dependent on the test, and item parameters (difficulty and discrimination) are dependent on the test takers.

For example, imagine you score a 6 out of 10 on a quiz in this class. What do we really know about your intro measurement mastery and ability based on this score? We only know that you obtained 6 out of 10 possible points. Maybe your friend who took the class last semester got a 9 out of 10 on a different version of the same quiz (I update quiz content each semester). Can we say with confidence that you are less able? Not really. All we know is that, relative to the items you took, you scored lower than your friend, relative to the items she or he took. Using the SEM, we could calculate a confidence interval around your score of 6 (give this a try with a standard deviation of 2 and a reliability of .84), but your true ability estimate can still only be interpreted within the context of the test you took.

Furthermore, imagine that next semester I give this same quiz to a new group of students and the discrimination (ITC) for one item is .12. That’s low, and it suggests that there’s a problem with the item. However, maybe the same item has an ITC of .52 in the midterm this semester. Which ITC is correct? According to CTT, they’re each correct, for the sample with which they are calculated! In CTT there is technically no absolute item difficulty or discrimination that generalizes across samples or populations of examinees. Likewise, there is no absolute ability estimate that generalizes across samples of items. This is the main limitation of CTT: it is sample and item/test dependent.

A second major limitation of CTT results from the fact that the model is specified using total scores. Because we rely on total scores in CTT, a given test only produces one estimate of reliability and one estimate of SEM that are assumed to be the same for all people taking the test. The consistency we expect to see in scores, and, conversely, the error we expect to see in scores, is the same regardless of ability level. This limitation is especially problematic when considering a set of items that does not address the ability level for a group of people. For example, consider a difficult high-stakes test with mean of 50 out of 100, standard deviation 15, and internal consistency reliability of .85. Overall, the test seems reliability. However, given the difficulty of the test, should we expect the reliability and SEM to be the same for a subset of low ability examinees with mean total score 10 and much more variable scores, e.g., with standard deviation 20? What about for a subset of higher ability examinees with a mean total score at 60 and standard deviation of 10? You should calculate and compare the SEM for these two examples.

When people tend to get all the items in a test correct or incorrect, their scores provide us with limited information about their ability. In an extreme case, if someone gets most of the items incorrect, we know that their ability is low, but aren’t as certain how low it really is, and on repeated testings we’d expect their score to fluctuate widely depending on the difficulty of the items they respond to. In situations like this, we expect reliability to be much lower than for examinees who are responding to items that match their ability level. However, CTT provides us with a single reliability, and SEM, that is applied to everyone, regardless of ability. Thus, the second major limitation of CTT is that SEM is constant and does not depend on ability.

Comparing with IRT

IRT was developed to address the main limitations of CTT, that of sample and item/test dependence and a single SEM. It too provides a model of test performance, however, the model is defined at the item level, meaning there is (essentially) a separate model equation for each item in the test. So, IRT involves item score models, as opposed to a single total score model. When the assumptions of the model are met, IRT parameters are, in theory, sample and item independent. This means that a person should have the same ability estimate no matter which set of items she or he takes. And a given item should have the same difficulty and discrimination no matter who is taking the test.

IRT also takes into account the difficulty of the items that a person responds to when estimating the person’s ability level. Although the ability estimate itself, in theory, does not depend on the items, as described above, the precision with which we estimate ability does depend on the items taken. Estimates of ability are more precise when they’re based on items that are close to a person’s ability level, whereas precision goes down when there are mismatches between ability and item difficulty. Thus, SEM in IRT depends on the ability of the person and the difficulty of the items given.

The main limitation of IRT is that it is a complex model requiring much larger samples of people and items than would be needed to utilize CTT. Whereas in CTT we have the recommended minimum of 100 examinees to conduct an item analysis (see Chapter 7), in IRT, as many as 500 or 1000 examinees may be needed to obtain stable results, depending on the complexity of the chosen model (there are many different IRT models).

Another key difference between IRT and CTT has to do with the shape of the relationship that we estimate between item score and construct score. IRT models a curvilinear relationship between the two, whereas CTT models a simple linear relationship between them. Recall from Chapter 7 that the discrimination for an item can be represented by a line in a scatterplot, with item performance on the y-axis and the construct on the x-axis, and a strong positive item discrimination evident in a line with a positive slope where observations are grouped tightly around the line. Because they’re based on correlations, ITC discrimination is always represented by a straight line. In IRT, we have something very similar, but the line follows what’s called a logistic function.

Figure 8.1 provides an example of CTT item discrimination. The plot includes item scores for one item (y-axis) plotted against total scores calculated across seven items (x-axis) taken by 100 people. The points are shifted left and right a bit on the x-axis to clarify how many people are at each total score. The straight line represents the discrimination, where ITC = .77. Although the discrimination is high, what do you notice about “high ability” individuals, e.g., those with total scores at or above 4 on the x-axis? How do they perform on the item?


PIC


Figure 8.1: Example of item discrimination from CTT.

The scatterplot in Figure 8.1 shows how a straight line is not the best way to represent the relationship between item score and total score. If the line represented our prediction for how well a person would do on the item, given their total score, we’re predicting outside the range of possible scores (0/1) for people below a total score of 2. If we go from the item to the total score, what does an item score of 0 tell us about how well someone will do overall? It should be clear from Figure 8.1 that a curvy line might do better at capturing the relationship between item scores and total scores.

IRT takes the same data plotted in Figure 8.1 and does two things with it. First, it estimates an ability parameter for each person. This is based on all the items in the test and it replaces the total score on the x-axis. The ability scale is arbitrary in IRT; it is usually centered at 0 and given a standard deviation of 1, like a z-score for a normal distribution. Second, in IRT we actually predict how likely a person is to get an item correct. In CTT, this prediction doesn’t actually happen; the ITC can be represented by a line through the scatterplot, but this line isn’t optimized to represent the observed data. The curvy discrimination line in IRT is intended to represent the observed data. It shows the predicted probability of getting an item correct given a person’s ability level.

Figure 8.2 shows what happens when we use the data from Figure 8.1 in an IRT model. Remember, the line represents the prediction based on the item discrimination, and the dots in this plot represent observed responses. Notice that the same general relationship is displayed in Figure 8.2 as in Figure 8.1: as you become more able (moving right on the x-axis) you are predicted to do better on the item (the y-axis). However, in Figure 8.2 our prediction is curvilinear. It never goes below a certain horizontal value to the left (called the lower asymptote or guessing parameter, c), it never goes above 1 on the right, and it has a specific slope (called the discrimination parameter, a) at its center (called the difficulty parameter, b).

Although we still have to define the model and how it works, the IRT representation of the relationship between item response and ability should already make more sense than the CTT one.


PIC


Figure 8.2: Example of item discrimination in the form of an item response function from IRT.

Traditional IRT Models

Terminology

Understanding IRT requires that we nail down some new terms. The underlying trait in IRT is referred to as theta (𝜃). This is the same construct we discussed in terms of CTT, reliability, and our earlier measurement models. In IRT we simply label it as theta. The theta scale is another name for the ability scale. Recall that the left side of the CTT model was the total score X. The IRT model has an item score on the left, and what is modeled is the probability of a correct response for a given item. Recall also that in CTT, we focus only on person ability in the model itself. In IRT, we include person ability and item parameters (difficulty, discrimination, and lower-asymptote). The item parameters are used to define the item response function (IRF; also known as the item characteristic curve), or the relationship between ability and the predicted performance on the item, as shown in Figure 8.2. Finally, this IRF gets its curviness from the equation that is used to produce it, which is defined by a logistic curve.

We can combine IRFs across multiple items to get what’s called a test response function (TRF) or test characteristic curve. In essence, this is simply the IRFs all added together, and it gives us the probability that you’ll get all the items on the test correct. In other words, it tells us what proportion of the test we’d predict you to get correct, based on ability. Finally, in other readings on IRT, you may encounter what’s called the item information function and test information function. These simply summarize where on the ability scale our item and test provide the most discrimination power, i.e., information. The opposite of the test information function is the test error function, which contains the different SEM for each ability level. Note that SEM is lowest where the majority of item difficulties are found, and error tends to increase the further you get from the distribution of items, that is, in the tails of the ability scale.

To recap, here are the key terms you need to know, with abbreviations as defined in the next section:

The IRT models

Using this terminology, we can now examine the traditional IRT models. Equation 8.2 contains what is referred to as the three-parameter IRT model, because it includes all three available item parameters. You don’t need to memorize this equation. I’m including it here simply for comparison purposes, and so you know what it looks like. As noted above, in IRT we’re modeling the probably of correct response on a given item ( Pr(X = 1)) as a function of person ability (𝜃) and certain properties of the item itself, namely: a, how well the item discriminates between low and high ability examinees; b, how difficult the item is, or the ability level at which we’d expect people to have a Pr = .50 of getting the item right; and c, the lowest Pr that we’d expect to see on the item by chance alone:

Pr(X = 1) = c + (1 c) ea(𝜃b) 1 + ea(𝜃b) (8.2)

The a and b parameters should make sense. They are IRT versions of what we’ve already discussed for CTT and item analysis: a corresponds to ITC, where larger a indicate larger, better discrimination; b corresponds to the opposite of the p-value, where a low b indicates an easy item, and a high b indicates a difficult item. The c parameter should be pretty intuitive if you think of its application to multiple-choice questions. No matter how low a person’s ability, do they ever have a probability of zero of getting a multiple-choice item correct? What is more likely the lowest chance of getting a multiple-choice item correct? Think about 100 test questions, each with 4 response options. How many of these should a low-ability monkey get right, by chance alone? Then, what is the probability of the monkey getting any given item correct? In IRT, we acknowledge with the c parameter that the probability of correct response may not be zero. So, the lowest we can go with Equation 8.2 is c (notice it’s the first value). Then, the rest of the equation squeezes the IRT curve (represented in 1 1+ea(𝜃b)) into the space between (1 c).

Again, don’t be alarmed if Equation 8.2 gives you a slight headache. You don’t have to memorize it. You just need to know what it does, what goes into the model, and then what comes out. Focus first on the difference we take between ability and item difficulty, in (𝜃 b). If someone is high ability and taking a very easy item (low b), we’re going to get a large difference between the two. This large difference filters through the rest of the equation to give us a higher prediction of how well the person will do on the item. The difference is multiplied by the discrimination parameter, so that, if the item is highly discriminating, the difference between ability and difficulty is magnified. If the discrimination is low, e.g., .5, the difference between ability and difficulty is cut in half before we use it to determine probability of correct response. The fractional part and the “exponential” term represented by e are there to make the straight line of ITC into a nice curve with a lower and upper asymptote at c and 1. Then, everything on the right of the equal sign in Equation 8.2 is used to estimate the left side, that is, how well a person with a given ability would do on a given item.

Recall that, for a single item in CTT, we have a p-value to represent performance on the item. This p-value tells us the mean performance for all people on the item. In IRT, we do something different. We find the mean performance by ability level. If the p-value for an entire sample of people is .5, what would you expect the p-value for just the high ability examinees to be? And what would you expect the p-value to be for just the low ability examinees? It turns out, if you get lots of data and you plot these ability-specific p-values for a single item, you end up with a line that follows the curve like the one in Figure 8.2. The IRT model in Equation 8.2 is the equation that produces the curve in Figure 8.2.

Figure 8.3 contains item response functions, that is, predicted probability of correct response, for seven different items having different discrimination, difficulty, and lower-asymptote parameters. The pink item would be considered the most difficult, as we only begin to predict that a person will get it correct once we move past an ability of 1. This item also has the highest discrimination. It is very useful for distinguishing between probabilities of correct response between ability levels of about 0 and 2.5; below and above these values, the item does not discriminate. Finally, this item appears to have a lower-asymptote of 0, suggesting it is probably not based on a multiple-choice question where guessing can impact scores.

Make sure you can identify the three item parameter for each of the curves plotted in Figure 8.3. You should be able to compare the items in terms of easiness/difficulty, low and high discrimination, and low and high predicted probability of guessing correctly. Here are some questions and answers for comparing the items in Figure 8.3:


PIC


Figure 8.3: Item response functions for items with different discrimination, difficulty, and lower-asymptote parameters.

There are two other traditional, dichotomous IRT models that are simplified versions of the three-parameter model in Equation 8.2. In the two-parameter IRT model, we remove the c parameter and ignore the fact that guessing may impact our predictions regarding how well a person will do on an item:

Pr(X = 1) = ea(𝜃b) 1 + ea(𝜃b) (8.3)

This may seem like an unreasonable simplification, especially for multiple-choice items, but the two-parameter model is commonly used. We simply assume that the impact of guessing is negligible. Applying the two-parameter model in Equation 8.3 to the items in Figure 8.3, we would see all the lower asymptotes pulled down to zero. Thus, all the IRF curves would hit the y-axis around 0. In the one-parameter model (called the Rasch model, after the person who first introduced it), we remove the c and a parameters and ignore the impact of guessing and items having differing discriminations. We assume that guessing is again negligible, and that discrimination is the same, a = 1, for all items:

Pr(X = 1) = e(𝜃b) 1 + e(𝜃b) (8.4)

This may again seem like an unreasonable simplification, but the Rasch model is actually very popular, perhaps even more popular than the two-parameter or three-parameter models. Its popularity is due to its simplicity; because we only need to estimate person ability and item difficulty, we can utilize the model with much smaller samples (e.g., 100 to 200 people per item, in some cases) than with the more complex models (which often require 500 to 1000 people).

Assumptions

The three traditional IRT models discussed above all involve two main assumptions, both of them having to do with the overall requirement that the model we chose is “correct” for a given situation. This correctness is defined based on 1) the dimensionality of the construct, that is, how many constructs are causing people to respond in a certain way to the items, and 2) the shape of the IRF, that is, which of the three item parameters are necessary for modeling item performance.

In Equations 8.2, 8.3, and 8.4 we have a single 𝜃 parameter. Thus, in these IRT models we assume that a single person attribute or ability underlies the item responses. This ability parameter is similar to the true score parameter in CTT. As mentioned above, the scale for ability is arbitrary, so a z-score metric is typically used. The first IRT assumption then is that a single attribute underlies the item response process. The result is called a unidimensional IRT model.

The second assumption in IRT is that we’ve chosen the correct shape for our item characteristic curve. This implies that we have a choice regarding which item parameters to include, whether only b in the simplest model, b and a in the next model, or b, a, and c in the most complex model. So, in terms of shape, we assume that there is a nonlinear relationship between ability and probability of correct response, and this nonlinear relationship is captured completely by up to three item parameters.

Note that anytime we assume a given item parameter, e.g., the c parameter, is unnecessary in a model, it is fixed to a certain value for all items. For example, in the Rasch and two-parameter IRT models, the c parameter is typically fixed to 0, which means we are assuming that guessing is not an issue. In the Rasch model we also assume that all items discriminate in the same way, and a is typically fixed to 1; then, the only item parameter we estimate is item difficulty.

Common uses

Because of its simplicity and lower sample size requirements, the Rasch model is most commonly used in small-scale achievement and aptitude testing, for example, with assessments developed and used as the district level, or instruments designed for use in research or lower-stakes decision making. The myIGDI measures discussed in Chapter 1 are developed using the Rasch model. The popular MAP tests, published by Northwest Evaluation Association, are also based on a Rasch model. Some consider the Rasch model most appropriate for theoretical reasons. In this case, it is argued that we should seek to develop tests that have items that discriminate equally well; items that differ in discrimination should be replaced with ones that do. Others utilize the Rasch model simply as a simplified IRT model, where we can’t obtain the sample size needed to accurately estimate different item discriminations and lower asymptotes. Either way, when using the Rasch model, we should be confident in our assumption that differences between items in discrimination and lower asymptote are negligible.

The two-parameter and three-parameter models are often used in larger-scale testing situations, for example, on high-stakes tests such as the GRE and ACT. The large samples available with these tests support the additional estimation required by these models. And proponents of the two-parameter and three-parameter models often argue that it is unreasonable to assume zero lower asymptote, or equal discriminations across items.

Applications

In terms of the properties of the model itself, as mentioned above, IRT overcomes the CTT limitation of sample and item dependence. As a result, ability estimates from an IRT model should not depend on the sample of items used to estimate ability, and item parameter estimates should not depend on the sample of people used to estimate them. An explanation of how this is possible is beyond the scope of this class. You just need to know that, in theory, when IRT is correctly applied, the resulting parameters are sample and item independent. As a result, they can be generalized across samples for a given population of people and test items.

IRT is useful first in item analysis, where we pilot test a set of items and then examine item difficulty and discrimination, as discussed in Chapter 7. The benefit of IRT over CTT is that we can accumulate difficulty and discrimination statistics for items over multiple samples of people, and they are, in theory, always expressed on the same scale. So, our item analysis results are sample independent. This is especially useful for tests that are maintained across more than one administration. Many admissions tests, for example, have been in use for decades. State tests, as another example, must also maintain comparable item statistics from year to year, since new groups of students take the tests each year.

Item banking refers to the process of storing items for use in future, potentially undeveloped, forms of a test. Because IRT allows us to estimate sample independent item parameters, we can estimate parameters for certain items using pilot data, i.e., before the items are used operationally. This is what happens in a computer adaptive test. For example, the difficulty of a bank of items is known, e.g., from pilot administrations. When you sit down to take the test, an item of known average difficulty can then be administered. If you get the item correct, you are given a more difficult item. The process continues, with the difficulty of the items adapting based on your performance, until the computer is confident it has identified your ability level. In this way, computer adaptive testing relies heavily on IRT.

Summary and Homework

This chapter provides a brief introduction to IRT, with a comparison to CTT, and details regarding the three traditional, dichotomous, unidimensional IRT models. For additional details on IRT, in comparison to CTT, see Hambleton and Jones (1993). See Harvey and Hammer (1999) for details on IRT in the context of psychological testing.

Learning objectives

1.
Compare and contrast IRT and CTT in terms of their strengths and weaknesses.
2.
Identify the two main assumptions that are made when using a traditional IRT model, regarding dimensionality and functional form or the number of model parameters.
3.
Identify key terms in IRT, including probability of correct response, logistic curve, theta, the IRF (aka ICC), TRF, SEM, and information.
4.
Define the three item parameters and one ability parameter in the traditional IRT models, and describe the role of each in modeling performance.
5.
Distinguish between the 1PL, 2PL, and 3PL IRT models in terms of assumptions made, benefits and limitations, and applications of each.
6.
Evaluate the appropriateness of the 1PL, 2PL, and 3PL in specific applications.
7.
Describe how IRT is utilized in item analysis, test development, item banking, and computer adaptive testing.