Chapter 7
Item Analysis

Introduction

Chapters 5 and 6 covered two topics that rely heavily on statistical analyses of data from educational and psychological measurements. These analyses are used to examine the relationships among scores on two or more test forms, in reliability, and based on ratings from two or more judges, in interrater reliability. Aside from coefficient alpha, all of the statistical analyses introduced so far focus on composite scores. Item analysis focuses instead on statistical analysis of the items themselves that make up these composites.

As discussed in Chapters 3 and 4, test items make up the most basic building blocks of an assessment instrument. Item analysis lets us investigate the quality of these individual building blocks, including in terms of how well they contribute to the whole and improve the validity of our measurement. This chapter extends concepts from Chapters 1, 5, and 6 to analysis of item performance within a CTT framework.

The chapter begins with an overview of item analysis, including some general guidelines for preparing for an item analysis and what we can expect to obtain by analyzing item performance. Next, the chapter covers some issues that arise when assigning score values to individual items. Finally, specific statistical indices will be introduced and their applications will be discussed. These indices include item difficulty or mean performance, item discrimination, internal consistency, differential item functioning, and distractor analysis.

Analyzing Items

Overview

As noted above, item analysis lets us examine the quality of individual test items. Information about individual item quality can help us determine whether or not an item is measuring the content and construct that it was written to measure, and whether or not it is doing so at the appropriate ability level. Because we are discussing item analysis here in the context of CTT, we’ll assume that there is a single construct of interest, perhaps being assessed across multiple related content areas, and that individual items can contribute or detract from our measurement of that construct by limiting or introducing construct irrelevant variance in the form of bias and random measurement error.

Bias represents a systematic error with an influence on item performance that can be attributed to an interaction between examinees and some feature of the test. Bias in a test item leads examinees with some known background characteristic, aside from their ability, to perform better or worse on an item simply because of this background characteristic. A common example is the use of scenarios or examples in an item that are more familiar to certain gender or ethnic groups. Differential familiarity with item content can make an item more relevant, engaging, and more easily understood, and can then lead to differential performance, even for examinees of the same ability level. We identify such item bias primarily by using measures of item difficulty and differential item functioning (DIF), discussed below and in Chapter 8.

Bias in a test item indicates that the item is measuring some other construct besides the construct of interest, where systematic differences on the other construct are interpreted as meaningful differences on the construct of interest. The result is a negative impact on the validity of test scores and corresponding inferences and interpretations. Random measurement error on the other hand is not attributed to a specific identifiable source, such as a second construct. Instead, measurement error is inconsistency of measurement at the item level. An item that introduces measurement error detracts from the overall internal consistency of the measure, and this is detected using item discrimination indices and a statistic called alpha-if-item-deleted (AID).

The goal in developing an instrument or scale is to identify bias and inconsistent measurement at the item level prior to administering a final version of our instrument. As we talk about item analysis, remember is that the analysis itself is typically carried out in practice using pilot data. Pilot data are gathered prior to or while developing an instrument or scale. These data require at least a preliminary version of the educational or psychological measure. So, we’ve written some items for our measure, and we want to see how well they work.

Ferketich (1991) and others note that the initial “pool” of candidate test items should be at least twice as large as the final number of items needed. So, if you’re dreaming up a test with 100 items on it, you should pilot at least 200 items. That may not be feasible, but it is a best-case scenario, and should at least be followed in large-scale testing. By collecting data on twice as many items as we intend to actually use we’re acknowledging that, despite our best efforts, many of our preliminary test items may either be low quality, e.g., biased or internally inconsistent, and they may address different ability levels or content than intended.

Ferketich (1991) also notes that data should be collected on at least 100 individuals from the population of interest. This too may not be feasible, however, it is essential if we hope to obtain results that will generalize to other samples of individuals. When our sample is not representative, e.g., when it is a convenience sample or when it contains fewer than 100 people, our item analysis results must be interpreted with caution. This goes back to inferences made based on any type of statistic: small samples leads to erroneous results. Keep in mind that every statistic discussed here has a standard error and confidence interval associated with it, whether it is directly examined or not. Note also that bias and measurement error arise in addition to this standard error or sampling error. Furthermore, we cannot identify bias in our test questions without representative data from our intended population. Thus, adequate sampling in the pilot study phase is critical.

One more thing to note as we go over the basics of item analysis: the statistics we’ll discuss here are based on the CTT model of test performance. In Chapter 8 we’ll discuss the more complex item response theory (IRT) and its applications in item analysis.

Scoring

In Chapter 1, which covered measurement, scales, and scoring, we briefly discussed the difference between dichotomous and polytomous scoring. In each case, we must assign value to each possible observed response to an item. This value is taken to indicate a difference in the construct underlying our measure. For dichotomous items, we usually assign a score of 1 to a correct response, and a zero otherwise. Polytomous items involve responses that are correct to differing degrees, e.g., incorrect (0), somewhat correct (1), and completely correct (2).

In psychological testing, we replace “correctness” from the educational context with “amount” of the trait or attribute of interest. So, a dichotomous item might involve a yes/no response, where “yes” is taken to mean the construct is present in the individual, and it is given a score of 1, whereas “no” is taken to mean the construct is not present, and it is given a score of 0. Polytomous items then allow for different amounts of the construct to be present.

Think back to the construct you identified in earlier examples, and consider what type of items you would be using (e.g., selected-response, constructed-response, performance assessment, rating scale) and what type of scoring would be used at the item level.

Although it seems standard to use dichotomous scoring of 0/1, and polytomous scoring of 0, 1, 2, ect., these values should not be taken for granted. The score assigned to a particular response determines how much a given item will contribute to any total score that is later calculated across items. In educational testing, the typical scoring schemes are popular because they are simple. Other scoring schemes could also be used to given certain items more or less weight when calculating the total.

For example, a polytomous item could be scored using partial credit, where incorrect is scored as 0, completely correct is given 1, and levels of correctness are assigned decimal values, e.g., .5. In psychological testing, the center of the rating scale could be given a score of 0, and the tails could decrease and increase from there. For example, if a rating scale is used to measure levels of agreement, 0 could be assigned to a “neutral” rating, and -2 and -1 might correspond to “strongly disagree” and “disagree,” whereas 1 and 2 correspond to “agree” and “strongly agree.” Changing the values assigned to item responses in this way can help improve the interpretation of summary results.

Scoring item responses also requires that some direction be given to the correctness or amount of trait involved. Thus, at the item level, we are at least using an ordinal scale. In educational testing, the direction is simple: increases in correctness correspond to increases in points. In psychological testing, reverse scoring may be necessary.

Consider the two examples presented below. These might come up on an end-of-semester evaluation for this course. Think about what construct is being measured here, and what type of scoring scheme you would use for each item.

Rate the degree to which you agree/disagree with the following statements:

If these items, and potentially others, are used to calculate an overall total score to represent satisfaction with this course (that’s one way of stating the construct), we need to decide (a) where we want average satisfaction to fall on our scale and (b) what direction we want to associate with high and low satisfaction. In terms of (a), the center is actually arbitrary, but something between 0 and 5 seems to make the most sense. Then, in terms of (b) it makes sense to have higher satisfaction correspond to higher scores. Given these choices, what numerical values would you assign to SD, D, A, and SA? And which item from this example would then require reverse scoring?

Item Difficulty

Once we have established a scoring scheme for each item in our test, and we have collected response data from a sample of individuals, we can start talking about the first statistic typically obtained in an item analysis: item difficulty, i.e., how easy or difficult each item is for our sample. In CTT, the item difficulty is simply the mean score for an item. For dichotomous (1/0) items, this mean is referred to as a p-value, since it represents the proportion of examinees getting the item correct. With polytomous items, the mean is simply the average score.

In IRT, which we’ll cover in Chapter 8, item difficulty is estimated as the predicted mean ability required to have a 50% chance of getting the item correct.

Note that item difficulty is most relevant to educational tests. In psychological testing, we refer to means on items simply as means, and we also often look at frequencies for each response choice, especially if using a rating scale. In this way, we consider the distribution of responses at the item level rather than just the central tendency.

Table 7.1 contains a simple example showing scores on two dichotomous items and one polytomous item. You may recognize the examinees in this data set as characters from Harry Potter. The scores were assigned, after much debate, in a previous section of the course. We decided that the scores represent some general measure of “magicalness,” where Dobby and Ginny both perform well, and Malfoy is clearly the least magical of the group.



Table 7.1: Hypothetical Example of Scoring and Item Difficulty for a Magicalness Scale
Person/Creature Item 1 Item 2 Item 3 Total
Dobby 1 1 3
Ginny 1 1 4
Harry 0 1 3
Snape 0 0 2
Ron 0 1 2
Malfoy 0 0 0
Total
Mean

Take a minute to calculate the total and then the average for each item. What do the averages tell you about each item? Which item is easiest and which is hardest? You should also calculate the total score for each individual. What do the total scores tell you about each individual?

There’s really not much else to say about scoring and item difficulty except this: when developing items we often want to make sure we target a specific ability or trait level. Note that the mean on item 1 is half the mean for item 2. If our test is intended to assess wizards of lower magical ability, it should be clear that item 2 is preferable to item 1; if we had to choose between them, we’d choose item 2. We also need to consider the overall distribution of item difficulties for our final test. In educational testing, we typically need to represent the entire ability scale, so that we can measure students of low and high ability. As a result, we need to make sure our item analysis produces items of both high and low difficulty.

To lead us into the next item analysis consideration, discrimination, suppose a fourth item were added to our “magicalness” test, and Ron and Malfoy were the only ones to get the item correct. Why is this result unexpected? And what might this unexpected result indicate about the item?

Item Discrimination

To answer the question above, if we’re developing a test of magical ability, we want each of the items on our test to assess magical ability. When low ability individuals (i.e., Ron and Malfoy) score high on a given item, and high ability individuals do not, something is clearly wrong with the item. Most likely, it is scored incorrectly, perhaps even miskeyed, or scored with the wrong answer keyed as correct. If the item is correctly keyed and scored, we would expect there to be a positive relationship between performance on the item and level on the construct. Being high on the construct should correspond to high in item performance, and being low on the construct should correspond to low item performance.

This relationship between item response and level on the construct is referred to as item discrimination. Good items should discriminate between individuals of high and low ability, so that, knowing an individual’s item score should tell us something about where they are on the construct. Consider the fourth item discussed (but not shown) above. Does Dobby’s score of 0 on item 4 provide a good indication of his magical ability? Not really. Does his score on item 1? Yes. What we need is some index of this discrimination power that takes into account everyone’s scores on the item and on the construct when considering the relationship between the two.

Items with high discrimination are strongly, positively related to the construct of interest, whereas items with low discrimination are not, and items with negative discrimination, as in item 4 from the magicalness example, may relate strongly but negatively with the construct.

Item discrimination is measured by comparing performance on an item for different groups of people, where groups are defined based on some measure of the construct. In the olden days, groups were simply defined as “high” and “low” using a cutoff on the construct to distinguish the two. Using the example above, if we somehow knew the true abilities of our group of wizards, and we split them into two ability groups, we could calculate and compare p-values for a given item for each group. If an item were highly discriminating, how would we expect these p-values to look? And if an item were not highly discriminating, how would we expect these p-values to look?

Although calculating p-values for different groups of individuals is still a useful approach to examining item discrimination, another statistic is typically used: the correlation between item responses and construct scores. The problem is, scores on the construct are hard to come by, which is why we developed our measure using CTT in the first place. In the absence of construct scores, total scores are typically used as a proxy. The resulting correlation is referred to as an item-total correlation (ITC). When responses on the item are dichotomously scored, it is also sometimes called a point-biserial correlation.

Note that when you correlate something with itself, the result should be a correlation of 1, and when you correlate a component score with a composite that includes, in part, that component, the correlation will increase simply because of the inclusion of the component on both sides of the relationship. Correlations between item responses and total scores can be “corrected” for this increase simply by excluding the given item when calculating the total. So, to estimate item discriminations on a three item test, we would calculate three total scores, each one excluding one of the three items. Then, we’d calculate the correlation between a given item response and the total score obtained by excluding that item. The result is referred to as a corrected item-total correlation (CITC).



Table 7.2: Means and Item Discriminations for the Hypothetical Magicalness Scale
Item Mean ITC CITC AID
1 0.33 0.75 0.60 0.13
2 0.67 0.93 0.89 -0.08
3 2.17 0.95 0.61 -0.30
4 0.33 -0.37 -0.56 0.74

Table 7.2 contains the results of a simple item analysis of our “magicalness” measure. Each row contains the mean, ITC, CITC, and alpha-if-item-deleted for an item. The means should correspond to what you calculated on your own. Item four is the item we invented to have a poor discrimination.

Note that the ITC are positive and strong for the first three items. This tells us that scores on each item are positively related to our proxy measure of the construct, the total score. The polytomous item, item 3, discriminates the best. Note also that the CITC are much lower than the uncorrected ones. Why is this the case? After making the correction, item 2 now seems to discriminate best.

Although there is not a clear guideline on acceptable or ideal levels of discrimination, he literature sometimes identifies 0.30 as a minimum. Otherwise, the higher the discrimination, the better. Note that, in practice, items with discriminations below 0.30 may still contribute to the scale, and cutoffs below 0.30 are sometimes used when accepting items for operational use.

When conducting an item analysis, we also need to consider the internal consistency reliability for our test. Coefficient alpha, from Chapter 5, is a measure of internal consistency. It tells us how well the items “hang together” in a set. I’m not sure who invented that term, but it seems to have stuck in the literature. “Hang together” refers to how consistently the item responses change, overall, in similar ways. A high coefficient alpha tells us that people tend to respond in similar ways from one item to the next. If coefficient alpha were perfectly 1, we would know that each person responded in exactly the same (rank ordered) way across all items.

Recall from Chapter 5 that coefficient alpha treats each item within a test all as a miniature measure of the same construct. So, we use alpha in an item analysis to identify items that contribute to the internal consistency of the item set. Items that detract from the internal consistency should be removed.

The last column in Table 7.2 includes the AID for each item in the scale. The AID tells us how coefficient alpha would change if a given item were removed from the calculation of internal consistency. Note that this is not the same as discrimination. Instead, the AID estimates how much an item directly contributes to the internal consistency reliability of the set of items.

The AID for each item is interpreted in comparison to the alpha for the full set of items. In this example, coefficient alpha is 0.41. If we remove any of the first three items, the AID tells us that alpha would decrease substantially, even become negative. However, if we remove the fourth item, the AID tells us that alpha would increase to 0.74! If our goal is to build an internally consistent test of magical ability, we need to get rid of item 4.

Before moving on, make sure you can articulate what the ITC, CITC, and AID tell us about an item. Although they tend to be strongly indirectly related (items with low or negative ITC/CITC have high positive AID), they are not the same thing. For example, what does the ITC of -0.37 tell you about responses on item 4? In your response to this question, you should reference the magical abilities of the examinees.



Table 7.3: Item Analysis Results for a BFI Agreeableness Scale
Item Mean SD ITC CITC AID
1 4.59 1.40 0.58 0.31 0.72
2 4.80 1.18 0.73 0.56 0.62
3 4.60 1.30 0.76 0.59 0.60
4 4.68 1.49 0.65 0.39 0.69
5 4.55 1.26 0.69 0.49 0.64

Let’s look quickly at results for an affective measure. Table 7.3 contains results from an item analysis for the agreeableness items on the BFI referenced in Chapter 4. The data come from a test administration to about 2700 people. Reminder: each item in the agreeableness scale is a statement, and people respond by indicating on a six-point rating scale how accurate the statement is as a description of them. The scale anchors are 1 = Very Inaccurate, 2 = Moderately Inaccurate, 3 = Slightly Inaccurate, 4 = Slightly Accurate, 5 = Moderately Accurate, 6 = Very Accurate. Here’s the text for the five items:

Am indifferent to the feelings of others
Inquire about others’ well-being
Know how to comfort others
Love children
Make people feel at ease

Note that these are not pilot data. These items were administered as part of an operational form of the instrument. However, we can still use the item analysis results to examine the quality of each item and the internal consistency of the scale. Alpha for all five items is 0.70. Make sure you are comfortable interpreting these results. For example, which item could be removed to improve the quality of the scale? What information is your choice of this item based on? Do discrimination and AID agree, or produce conflicting results?

Distractor analysis involves the examination of item responses by ability groupings for each option in a selected-response item. When interpreting results, we typically focus on correct responses in item analysis and scoring, which, in educational testing, provide information about what an individual knows or can do. In distractor analysis, we also look at incorrect responses.

Distractor analysis involves the calculation of bivariate frequency distributions for unscored items as categorical variables. Frequencies for certain response options should follow certain trends for certain ability groups. Consider the types of information that an incorrect response provides, for example, about an individual and about an item. Here are two related questions that we seek to address with a distractor analysis: who do we expect will get an item incorrect, and what does it mean if high ability people choose an incorrect option?



Table 7.4: Distractor Analysis for a Selected-Response Math Question
Ability Group



Response Low Med High Total
A 36 21 8 65
B 16 16 2 34
C 10 13 24 47
D 2 1 0 3
Total 64 51 34 149

Table 7.4 contains distractor analysis results for a selected-response test item. The data come from a test of math ability given to preservice math teachers. Each row corresponds to an option for the item. So, there are four options, A through D. The columns show how many examinees in each of three ability groups chose each response option. For example, 36 low ability examinees chose option A. No high ability examinees chose option D. Based on these results, what would you expect the correct response to be?

Our main goal in distractor analysis is to identify dysfunctional and/or useless distractors, ones which do not provide us with any useful information about examinees. Do you see any options that don’t function well? The correct response for this item is C, and this option appears to be functioning well; high ability examinees chose it more often than medium or low ability examinees. If low ability examinees chose the correct option more often than higher ability examinees, there would be a problem with the item.

Next, look at options A and B. Both are incorrect, and low ability examinees choose them most often, followed by medium ability, and then high. Few high ability examinees choose A and B. This is the trend that we expect to see with functional distractors. Finally, look at option D. Only 2 low ability examinees and 1 medium ability examinee chose it. Is this distractor fulfilling its purpose? Not really, because it isn’t distracting anyone, especially the low ability examinees whom it is designed to distract. Option D isn’t really telling us any new or useful information. We could easily remove the option, saving examinees a few seconds of mental processing, and the test shouldn’t be negatively impacted.

Differential Item Functioning

Consider a test question where students of the same ability level respond differently based on certain demographic or background features pertaining to the examinee but not relating directly to ability. In distractor analysis, we examine categorical frequency distributions for each response option by ability groups. In DIF, we examine these same categorical frequency distributions but by these different demographic groups, where all examinees in the analysis have the same ability. DIF in a test item is evidence of potential bias in the item, as, after controlling for ability, demographic variables should not produce significant differences in examinee responses.

A variety of statistics are used to estimate DIF in educational and psychological testing. We will not review them all here. Instead, we will focus on the concept of DIF and how it is displayed in item level performance in general. We’ll return to this discussion in Chapter 8, within IRT.

DIF is most often based on a statistical model that compares item difficulty, or the mean performance on an item, for two groups of examinees, after controlling for their ability levels. Here, controlling for refers to either statistical or experimental control. The point is that we in some way remove the effects of ability on mean performance per group, so that we can then examine any leftover performance differences. Testing for the significance of DIF can be done, e.g., using IRT, logistic regression, or chi-square statistics. Once DIF is identified for an item, the item itself is examined for potential sources of differential performance by subgroup. The item is either rewritten or deleted.

Summary and Homework

This chapter provided a brief introduction to item analysis, including some guidelines for collecting pilot data, and five types of statistics used to examine item level performance, including item difficulty, item discrimination, internal consistency, distractor analysis, and differential item functioning. These statistics are used in combination to identify items that contribute or detract from the quality of a measure.

Item analysis, as described in this chapter, is based on a CTT model of test performance. We have assumed that a single construct is being measured, and that item analysis results are based on a representative sample from our population of test takers. Chapter 8 builds on the concepts introduced here by extending them to the more complex but also more popular IRT model of test performance.

Learning objectives

1.
Explain how item bias and measurement error negatively impact the quality of an item, and how item analysis, in general, can be used to address these issues.
2.
Describe general guidelines for collecting pilot data for item analysis, including how following these guidelines can improve item analysis results.
3.
Identify items that may have been keyed or scored incorrectly.
4.
Recode variables to reverse their scoring or keyed direction.
5.
Calculate and interpret item difficulties and compare items in terms of difficulty.
6.
Calculate and interpret item discrimination indices, and describe what they represent and how they are used in item analysis.
7.
Describe the relationship between item difficulty and item discrimination and identify the practical implications of this relationship.
8.
Calculate and interpret alpha-if-item-removed.
9.
Utilize item analysis to distinguish between items that function well in a set and items that do not.
10.
Remove items from an item set to achieve a target level of reliability.
11.
Evaluate selected-response options using distractor analysis.

Discussion Questions

1.
Why must we be cautious about interpreting item analysis results based on pilot data?
2.
For an item with high discrimination, how should p-values on the item compare for two groups known to differ in their true mean abilities?
3.
Why does discrimination usually decrease for CITC as compared with ITC?
4.
What features of certain response options, in terms of the item content itself, would make them stand out as problematic within a distractor analysis?