# Chapter 2 Measurement, Scales, and Scoring

Measurement is never better than the empirical operations by which it is carried out, and operations range from bad to good.
— Stanley Stevens, On the Theory of Scales of Measurement

The Preface to this book introduced a few perspectives on testing, with an emphasis on validity as a measure of the effectiveness of test scores. Validity is an overarching issue that encompasses all stages in the test development and administration processes, from blueprint to bubble sheet, including the stage wherein we choose the empirical operations that will assign numbers or labels to test takers based on their performance or responses.

In this chapter, we examine the measurement process at its most fundamental or basic level, the measurement level. We’ll define three requirements for measurement, and consider the simplicity of physical measurement in comparison to the complexities of educational and psychological measurement where the thing we measure is often intractable and best represented using item sets and composite scores. Along the way, we’ll describe four types of measurement scales that are available, with examples from the PISA data set, and we’ll look into why Stevens (1946) concluded that not all scales are created equal. Last are scoring and score referencing, including examples of norm and criterion referencing.

Learning objectives

1. Define the process of measurement.
2. Define the term construct and describe how constructs are used in measurement, with examples.
3. Compare and contrast measurement scales, including nominal, ordinal, interval, and ratio, with examples, and identify their use in context.
4. Compare and contrast dichotomous and polytomous scoring.
5. Describe how rating scales are used to create composite scores.
6. Explain the benefits of composites over component scores.
7. Create a generic measurement model and define its components.
8. Define norm referencing and identify contexts in which it is appropriate.
9. Compare three examples of norm referencing: grade, age, and percentile norms.
10. Define criterion referencing and identify contexts in which it is appropriate.
11. Describe how standards and performance levels are used in criterion referencing with standardized state tests.
12. Compare and contrast norm and criterion score referencing, and identify their uses in context.

In this chapter, we’ll analyze and create plots with PISA09 data using the epmr and ggplot2 packages. We’ll also analyze some data on US states, contained within the datasets package automatically included with R.

# R setup for this chapter
# Required packages are assumed to be installed - see chapter 1
library("epmr")
library("ggplot2")
# Functions we'll use in this chapter
# data(), class(), factor(), c(), from chapter 1
# head() to print the first six rows or values in an object
# paste0() for pasting together text and using it to index a data set
# apply() for applying a function over rows or columns of a data set
# tapply() for applying a function over groups
# dstudy() from the epmr package for getting descriptives
# ggplot(), aes(), and geom_boxplot() for plotting
# round(), nrow(), and with() for examining data
# We'll use a data set included in the base R packages called state.x77

## 2.1 What is measurement?

### 2.1.1 How do we define it?

We usually define the term measurement as the assignment of values to objects according to some system of rules. This definition originates with Stevens (1946), who presented what have become the four traditional scales or types of measurement. We’ll talk about these shortly. For now, let’s focus on the general measurement process, which involves giving an object of measurement, the person or thing for whom we’re measuring, a value that represents something about it.

Measurement is happening all the time, all around us. Daily, we measure what we eat, where we go, and what we do. For example, drink sizes are measured using categories like tall, grande, and venti. A jog or a commute is measured in miles or kilometers. We measure the temperature of our homes, the air pressure in our tires, and the carbon dioxide in our atmosphere. The wearable technology you might have strapped to your wrist could be monitoring your lack of movement and decreasing heart rate as you doze off reading this sentence. After you wake up, you might check your watch and measure the length of your nap in minutes or hours.

These are all examples of physical measurement. In each example, you should be able to identify 1) the object of measurement, 2) the property or quality that’s being measured for it, and 3) the kinds of values used to represent amounts of this property or quality. The property or quality that’s being measured for an object is called the variable. The kinds of values we assign to an object, for example, grams or degrees Celsius or beats per minute, are referred to as the units of measurement that are captured within that variable.

Let’s look at some examples of measurement from the state data sets in R. The object state.x77 contains data on eight variables, with the fifty US states as the objects of measurement. For details on where the data come from, see the help file ?state.

# Load state data
data(state)
# Print first 6 rows, all columns
##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766

Take a minute to consider what the variables in state.x77 are measuring, and what the units of measurement are for these variables. For example, state.x77[, "Population"] contains population estimates from 1975 for each state, expressed as thousands. So, state.x77["Wisconsin", "Population"] gives us 4589, or a population of 4589 thousand people. What other variables from state.x77 are measured as simple counts? To practice what you learned in Chapter 1, try to convert the illiteracy rates in state.x77[, "Illiteracy"] from proportions to counts for each state.

dstudy(PISA09$atotal) ## ## Descriptive Study ## ## mean median sd skew kurt min max n na ## x 12.3 13 2.71 -1.66 8.12 0 16 44878 0 In Chapter 4 we will address rating scales in more detail. We’ll cover issues in the construction and administration of rating categories. Here, we are more concerned with the benefits of using composite scale scores. ### 2.3.4 Composites versus components A composite score is simply the result of some combination of separate subscores, referred to as components. Thus far, we have created total scores in R. Most often, we will deal with either total scores or factor scores on a test, where individual items make up the components. Factor scores refer to scores obtained from some measurement model, such as a classical test theory model, discussed in Chapter 5, or an item response theory model, discussed in Chapter 7. Factor analysis is covered in Chapter 8. We will also encounter composite scores based on totals and means from rating scale items. In each case, the composite is going to be preferable to any individual component for the following reasons. Composite scores are preferable from a statistical standpoint because they tend to provide a more reliable and valid measure of our construct. Composites are more reliable and valid because they combine information from multiple smaller, repeated measures of the construct. These smaller components may each be limited in certain ways, or may only present a small piece of the big picture, and when combined, the resulting score is more comprehensive and more easily reproduced in subsequent measurements. In Chapter 5, we’ll learn more about why reliability is expected to increase, in theory, as we increase the number of items in our composite. For example, when measuring a construct such as attitude toward animal rights, a single item would only provide information about a specific instance of the issue. Consider the example survey items presented by Mathews and Herzog (1997): The Animal Attitude Scale (AAS) assesses individual differences in attitudes toward the treatment of animals… It is composed of 29 items which subjects rate on a five-point Likert scale (strongly agree to strongly disagree). Sample items include, “I do not think that there is anything wrong with using animals in medical research,” “It is morally wrong to hunt wild animals just for sport,” and “I would probably continue to use a product that I liked even though I know that its development caused pain to laboratory animals.” [p. 171] By themselves, any one of these items may not reflect the full construct that we are trying to measure. A person may strongly support animal rights, except in the case of medical research. Or a person may define the phrase “that I liked,” from the third example question, in different ways so that this individual question would produce different results for people who might actually be similar in their regard for animals. A composite score helps to average out the idiosyncracies of individual items. (Side note from this study: a regression model showed that 25% of the variance in attitude toward animals was accounted for by gender and a personality measure of sensitivity.) Learning check: Suppose a composite score is calculated based on total scores across the 29 AAS items. What type of scoring would need to first take place with the items themselves? What would be the range for total scores? The simpler methods for creating composites, by averaging and totaling across items, are used with smaller-scale instruments to facilitate scoring and score reporting. A major drawback to simple averages and totals is the lack of interval properties in the resulting scales. Another drawback is the presence of measurement error, or random noise, within average or total scores. To avoid these and other drawbacks, many instruments, including large-scale educational tests and psychological measures, are scaled using measurement models. ## 2.4 Measurement models Whereas a simple sum or average over a set of items lets each item contribute the same amount to the overall score, more complex measurement models can be used to estimate the different contributions of individual items to the underlying construct. These contributions can be examined in a variety of ways, as discussed in Chapters 5, 7, and 8. Together, they can provide useful information about the quality of a measure, as they help us understand the relationship between our operationalization of the construct, in terms of individual items, and the construct itself. Measurement models represent an unobservable construct by formally incorporating a measurement theory into the measurement process. We will review two formal measurement theories in this book. The first is called classical test theory, presented in Chapter 5, and the second is item response theory, presented in Chapter 7. (for a comparison, see Hambleton and Jones 1993). For now, we’ll just look at the basics of what a measurement model does. Figure 2.3 contains a visual representation of a simple measurement model where the underlying construct of sociability, shown in an oval, causes, in part, the observed responses in a set of three questions, shown in rectangles as Item 1, Item 2, and Item 3. Unobservable quantities in a measurement model are typically represented by ovals, and observable quantities by rectangles. Causation is then represented by the arrows which point from the construct to the item responses. The numbers over each arrow from the construct are the scaled factor loadings reported in Nelson et al. (2010), which represent the strength of the relationship between the items and the construct which they together define. As with a correlation coefficient, the larger the factor loading, the stronger the relationship. Thus, item 1 has the strongest relationship with the sociability factor, and item 3 has the weakest. The other unobserved quantities in Figure 2.3 are the error terms, in the circles, which also impact responses on the three items. Without arrows linking the error terms from one to another, the model assumes that errors are independent and unrelated across items. In this case, any influence on a response that does not come from the common factor of sociability is attributed to measurement error. Note that the model in Figure 2.3 breaks down our observed variables, the scored item responses, into two unrelated sources of variability, one based on the commonly measured construct, and the other based on individual item errors. In theory, these errors are still present in our total scores, but they are extracted from the construct scores produced by a measurement model. Thus, measurement models provide a more reliable measure of the construct. Learning check: Describe the different parts of the sociability measurement model. What do the ovals, rectangles, and arrows represent? Models such as the one in Figure 2.3 are referred to as confirmatory factor analysis models, because we propose a given structure for the relationships between constructs, error, and observations, and seek to confirm it by placing certain constraints on the relationships we estimate. In Chapter 8, we’ll discuss these along with exploratory models where we aren’t certain how many underlying constructs are causing responses. ## 2.5 Score referencing Now that we’ve discussed the measurement process we can go over some common methods for giving meaning to the scores that our measures produce. These methods are referred to as norm and criterion score referencing. Each is discussed briefly below, with examples. ### 2.5.1 Norm referencing Norm referencing gives meaning to scores by comparing them to values for a specific norm group. For example, when my kids bring home their standardized test results from school, their scores in each subject area, math and reading, are given meaning by comparing them to the distribution of scores for students across the state. A score of 22 means very little to a parent who does not have access to the test itself. However, a percentile score of 90 indicates that a student scored at or above 90% of the students in the norming group, regardless of what percentage of the test questions they answered correctly. Norms are also frequently encountered in admissions testing. If you took something like the ACT or SAT, college admissions exams used in the US, or the GRE, an admissions test for graduate school, you’re probably familiar with the ambiguous score scales these exams use in reporting. Each scale is based on a conversion of your actual test scores to a scale that is intentionally difficult or impossible to understand. In a way, the objective in this rescaling of scores is to force you to rely on the norm referencing provided in your score report. The ACT scales range from 1 to 36, but a score of 20 on the math section doesn’t tell you a lot about how much math you know or can do. Instead, by referencing the published norms, a score of 20 tells you that you scored around the 50th percentile for all test takers. The two examples above involve simple percentile norms, where scores are compared to the full score distribution for a given norm group. Two other common types of norm referencing are grade and age norms, which are obtained by estimating the typical or average performance on a test by grade level or age. For example, we can give meaning to PISA09 reading scores by comparing them to the medians by grade. # tapply() is like apply() but instead of specifying rows or columns of # a matrix, we provide an index variable. Here, median() will be applied # over subsets of rtotal by grade. with() is used to subset only German # students. with(PISA09[PISA09$cnt == "DEU", ],
tapply(rtotal, grade, median, na.rm = TRUE))
##    7    8    9   10   11   12
##  1.5  3.0  5.0  7.0 10.0  9.0
# Most German students are in 9th grade. Medians aren't as useful for
# grades 7, 11, and 12.
table(PISA09$grade[PISA09$cnt == "DEU"])
##
##   7   8   9  10  11  12
##  12 155 771 476   5   1

A reading score of 6, for example, would be given a grade norm of 9, as it exceeds the median score for 9th graders but not 10th graders. In practice, grade norms are reported using decimals that capture the month within the school year as well, For example, a 9.6 would indicate that a student’s reading score is at the median performance of students in their sixth month of 9th grade. These normative scores by grade are referred to as grade equivalents. Age norms and age equivalents are calculated in the same way, but using age as the indexing variable.

Again, norms give meaning to a score by comparing it to the score distribution for a particular norming group. Box plots can be used to visualize a score distribution based on the 25th, 50th, and 75th percentiles, along with any outliers.

ggplot(PISA09[PISA09$cnt == "DEU" & PISA09$grade %in% 8:10, ],
aes(x = factor(grade), y = rtotal)) +
geom_boxplot()

### 2.5.2 Criterion referencing

The main limitation of norm referencing is that it only helps describe performance relative to other test takers. Criterion score referencing does the opposite. Criterion referencing gives meaning to scores by comparing them to values directly linked to the test content itself, regardless of how others perform on the content (Popham and Husek 1969).

Educational tests supporting instructional decision making are often criterion referenced. For example, classroom assessments are used to identify course content that a student has and has not mastered, so that deficiencies can be addressed before moving forward. The vocabulary test mentioned above is one example. Others include tests used in student placement and exit testing.

Standardized state test results, which were presented above as an example of norm referencing, are also given meaning using some form of criterion referencing. The criteria in state tests are established, in part, by a panel of teachers and administrators who participate in what is referred to as a standard setting. State test standards are chosen to reflect different levels of mastery of the test content. In Nebraska, for example, two cut-off scores are chosen per test to categorize students as below the standards, meets the standards, and exceeds the standards. These categories are referred to as performance levels. Student performance can then be evaluated based on the description of typical performance for their level. Here is the performance level description for grade 5 science, meets the standard, as of 2014:

Overall student performance in science reflects satisfactory performance on the standards and sufficient understanding of the content at fifth grade. A student scoring at the Meets the Standards level generally draws on a broad range of scientific knowledge and skills in the areas of inquiry, physical, life, and Earth/space sciences.

The Nebraska performance categories and descriptions are available online at www.education.ne.gov/assessment. Performance level descriptions are accompanied by additional details about expected performance for students in this group on specific science concepts. Again for grade 5 science, meets the standard:

A student at this level generally:
1. Identifies testable questions,
2. Identifies factors that may impact an investigation,
3. Identifies appropriate selection and use of scientific equipment,
4. Develops a reasonable explanation based on collected data,
5. Describes the physical properties of matter and its changes.

The performance levels and descriptors used in standardized state tests provide general information about how a test score relates to the content that the test is designed to measure. Given their generality, these results are of limited value to teachers and parents. Instead, performance level descriptors are used for accountability purposes, for example, to assess performance at the school, district, and even the state levels in terms of the numbers of students meeting expectations.

The Beck Depression Inventory (BDI; Beck et al. 1961) is an example of criterion referencing in psychological testing. The BDI includes 21 items representing a range of depressive symptoms. Each item is scored polytomously from 0 to 3, and a total score is calculated across all of the items. Cutoff scores are then provided to identify individuals with minimal, mild, moderate, and severe depression, where lower scores indicate fewer depressive symptoms and higher scores indicate more severe depressive symptoms.

### 2.5.3 Comparing referencing methods

Although norm and criterion referencing are presented here as two distinct methods of giving meaning to test scores, they can sometimes be interrelated and thus difficult to distinguish from one another. The myIGDI testing program described above is one example of score referencing that combines both norms and criteria. These assessments were developed for measuring growth in early literacy skills in preschool and kindergarten classrooms. Students with scores falling below a cut-off value are identified as potentially being at risk for future developmental delays in reading. The cut-off score is determined in part based on a certain percentage of the test content (criterion information) and in part using mean performance of students evaluated by their teachers as being at-risk (normative information).

Norm and criterion referencing serve different purposes. Norm referencing is typically associated with tests designed to rank order test takers and make decisions involving comparisons among individuals, whereas criterion referencing is associated with tests designed to measure learning or mastery and make decisions about individuals and programs (e.g., Bond 1996; Popham and Husek 1969). These different emphases are relevant to the purpose of the test itself, and should be considered in the initial stages of test development, as discussed in Chapters 3 and 4.

## 2.6 Summary

This chapter provides an overview of what measurement is, how measurement is carried out in terms of scaling and scoring, and how measurement is given additional meaning through the use of score referencing and scale transformation. Before moving on to the next chapter, make sure you can respond to the learning objectives for this chapter, and complete the exercises below.

### 2.6.1 Exercises

1. Teachers often use brief measures of oral reading fluency to see how many words students can read correctly from a passage of text in one minute. Describe how this variable could be modified to fit the four different scales of measurement.
2. Examine frequency distributions for each attitude toward school item, as was done with the reading items. Try converting counts to percentages.
3. Plot a histogram and describe the shape of the distribution of attitude toward school scores.
4. What country has the most positive attitude toward school?
5. Describe how both norm and criterion referencing could be helpful in an exam used to screen applicants for a job.
6. Describe how norm and criterion referencing could be used in evaluating variables outside the social sciences, for example, with the physical measurement applications presented at the beginning of the chapter.
7. Provide details about a measurement application that interests you.
1. How would you label your construct? What terms can be used to define it?
2. With whom would you measure this construct? Who is your object of measurement?
3. What are the units of measurement? What values are used when assigning scores to people? What type of measurement scale will these values produce?
4. What is the purpose in measuring your construct? How will scores be used?
5. How is your construct commonly measured? Are there existing measures that would suit your needs? If you’re struggling to find a measurement application that interests you, you can start with the construct addressed in this book. As a measurement student, you possess an underlying construct that will hopefully increase as you read, study, practice, and contribute to discussions and assignments. This construct could be labeled assessment literacy (Stiggins 1991).

### References

Stevens, S. S. 1946. “On the Theory of Scales of Measurment.” Science 103: 677–80.

Nelson, D. A., C. C. Robinson, C. H. Hart, A. D. Albano, and S. J. Marshall. 2010. “Italian Preschoolers’ Peer-Status Linkages with Sociability and Subtypes of Aggression and Victimization.” Social Development 19: 698–720.

Bradfield, T. A., A. C. Besner, A. K. Wackerle-Hollman, A. D. Albano, M. C. Rodriguez, and S. R. McConnell. 2014. “Redefining Individual Growth and Development Indicators: Oral Language.” Assessment for Effective Intervention 39: 233–44.

Myers, I. B., M. H. McCaulley, N. L. Quenk, and A. L. Hammer. 1998. “Manual: A Guide to the Development and Use of the Myers-Briggs Type Indicator.” Palo Alto, CA: Consulting Psychologist Press.

Likert, R. 1932. “A Technique for the Measurement of Attitudes.” Archives of Psychology 22: 5–55.

Mathews, S., and H. A. Herzog. 1997. “Personality and Attitudes Toward the Treatment of Animals.” Society and Animals 5: 169–75.

Hambleton, R. K., and R. W. Jones. 1993. “Comparison of Classical Test Theory and Item Response Theory and Their Applications to Test Development.” Educational Measurement: Issues and Practice, 38–47.

Popham, W. J., and T. R. Husek. 1969. “Implications of Criterion-Referenced Measurement.” Journal of Educational Measurement 6: 1–9.

Beck, A T, C H Ward, M Mendelson, J Mock, and J Erbaugh. 1961. “An Inventory for Measuring Depression.” Archives of General Psychiatry 4: 53–63.

Bond, L. A. 1996. “Norm- and Criterion-Referenced Testing.” Practical Assessment Research & Evaluation 5 (2).

Stiggins, R. 1991. “Assessment Literacy.” Phi Delta Kappan 72 (534–539).