Chapter 1
Measurement, Scales, and Scoring

Measurement is never better than the empirical operations by which it is carried out, and operations range from bad to good.


— Stanley Stevens, On the Theory
of Scales of Measurement

Introduction

The previous chapter briefly introduced a few perspectives on testing, with an emphasis on validity as a measure of the effectiveness of test scores. Validity is an overarching issue that encompasses all stages in the test development and administration processes, from blueprint to bubble sheet, including the stage wherein we choose the empirical operations that will assign numbers or labels to test takers based on their performance or responses.

In this chapter, we’ll examine the measurement process at its most fundamental or basic level, the measurement level. We’ll define the three requirements for measurement, and consider the simplicity of physical measurement in comparison to the complexities of educational and psychological measurement where the thing we measure is often intractable and best represented using item sets and composite scores. Along the way, we’ll describe the four types of measurement scales that are available, and we’ll look into why Stevens (1946) concluded that not all scales are created equal. Last are scoring and scoring referencing, including examples of norm and criterion referencing.

What is Measurement?

How do we define it?

We usually define the term measurement as the assignment of values to objects according to some system of rules. This definition originates with Stevens (1946), who presented what have become the four traditional scales or types of measurement. We’ll talk about these shortly. For now, let’s focus on the general measurement process, which involves giving an object, the person or thing for whom we’re measuring, a value that represents something about it.

Measurement is happening all the time, all around us. Daily, we measure what we eat, where we go, and what we do. For example, drink sizes are measured using categories like tall, grande, and venti. A jog or a commute is measured in miles or kilometers. We measure the temperature of our homes, the air pressure in our tires, and the carbon dioxide in our atmosphere. The wearable technology you might have strapped to your wrist could be monitoring your lack of movement and decreasing heart rate as you doze off reading this sentence. After you wake up, you might check your watch and measure the length of your nap in minutes or hours.

These are all examples of physical measurement. In each example, you should be able to identify 1) the object of measurement, 2) the property or quality that’s being measured for it, and 3) the kinds of values that could be used to represent amounts of this quality or property. The property or quality that’s being measured for an object is called the variable. The kinds of values we assign to an object, for example, grams or degrees Celsius or beats per minute, are referred to as the units of measurement that are captured within that variable.

So, three things are required for measurement to happen: an object, a variable, and values or units. Again, the variable is the quality or property we measure, the object is for whom we measure it, and the values are the numbers or labels we assign. Once you can identify these three components for each physical measurement example above, make sure you can come up with your own examples that contain all three parts.

From Physical to Intangible

With most physical measurements, the property that we’re trying to represent or capture with our values can be clearly defined and consistently measured. For example, amounts of food are commonly measured in grams. A cup of cola has about 44 grams of sugar in it. When you see that number printed on your can of soda pop or fizzy water, the meaning is pretty clear, and there’s really no need to question if its accurate. Cola has a lot of sugar in it.

But, just as often, we take a number like the amount of sugar in our food and use it to represent something abstract or intangible like how healthy or nutritious the food is. A food’s healthiness isn’t as easy to define as its mass or volume. A measurement of healthiness or nutritional value might account for the other ingredients in the food and how many calories they boil down to. Furthermore, different foods can be more or less nutritional for different people, depending on a variety of factors. Healthiness, unlike physical properties, is intangible and difficult to measure.

The social sciences of education and psychology typically focus on the measurement of constructs, intangible and unobservable qualities, attributes, or traits that we assume are causing certain observable behavior or responses. In this course, our objects of measurement are typically people, and our goal is to give these people numbers or labels that tell us something meaningful about qualities such as their intelligence, their math ability, or their social anxiety. Constructs such as these are difficult to measure. That’s why we need an entire course to discuss how to best measure them.

A good question to ask at this point is, how can we measure and provide values for something that’s unobservable? How do we score a person’s math ability if we can’t observe it directly? What we need is an operationalization of our construct, an observable behavior or response that increases or decreases as a person moves up or down on the construct. With math ability, that operationalization might be the number of math questions a person answers correctly out of 20. With social anxiety, it might be the frequency of feeling anxious over a given period of time. When using a proxy for our construct, we have to assume or infer that the operationalization we’re actually observing and measuring accurately represents the underlying quality or property that we’re interested in. This brings us to the overarching question for this course.

What makes measurement good?

In the last year of my undergraduate in psychology I conducted a research study on the constructs of aggression, sociability, and victimization with Italian preschoolers (D. A. Nelson, Robinson, Hart, Albano, & Marshall, 2010). I spent about four weeks collecting data in preschools. Data collection involved covering a large piece of cardboard with pictures of all the children in a classroom, and then asking each child, individually, questions about their peers.

To measure sociability, we asked three simple questions: “who is fun to talk to?” “who is fun to do pretend things with?” and “who has many friends?” Kids with lots of peer nominations on these questions received a higher score, indicating that they were more sociable. After asking these and other questions to about 300 preschoolers, and then tallying up the scores, I wondered how well we were actually measuring the constructs we were targeting. Were these scores any good? Was three or five questions enough? Maybe we were missing something important? Maybe some of these questions, which had to be translated from English into Italian, meant different things on the coast of the Mediterranean than they did in the Midwest US?

This project was my first experience on the measuring side of measurement, and it fascinated me. The questions that I asked then are the same questions that we’ll ask and answer in this course. How consistently and accurately are we measuring what we intend to measure? What can we do to improve our measurement? And how can we identify instruments that are better or worse than others? These questions all have to do with what makes measurement good.

Many different things make measurement good, from writing high-quality questions and items to adherence to established test development guidelines. For the most part, the resulting scores are considered good, or effective, when they consistently and accurately describe a target construct. Consistency and accuracy refer to the reliability and validity of test scores, that is, the extent to which the same scores would be obtained across repeated administrations of a test, and the extent to which scores fully represent the construct they are intended to measure.

These two terms, reliability and validity, will come up many times throughout the course. The second one, validity, will help us clarify our definition of measurement in terms of its purpose. Of all the considerations that make for effective measurement, the first to address is purpose.

What is the purpose?

Measurement is useless unless it is based on a clearly articulated purpose. This purpose describes the goals of administering a test or survey, including what will be measured, for whom, and why? We’ve already established the “what?” as the variable or construct, the property, quality, attribute, or trait that our numbers or values represent. We’ve also established the “for whom?” as the object, in our case, people, but more specifically perhaps students, patients, or employees. Now we need to establish the “why?”

The purpose of a test specifies its intended application and use. It addresses how scores from the test are designed to be interpreted. A test without a clear purpose can’t be effective.

Suppose someone asks you to create a measure of students’ financial savvy, that is, their understanding of money and how its used in finance. You’ve got here a simple construct, understanding of finance, and the object of measurement, students. But before you can develop this test you’d need to know how it is going to be used. Its purpose will determine key features like what specific content the test contains, the level of difficulty of the questions, the types of questions used, and how its administered. If the test is used as a final exam in a finance course, it should capture the content of that course, and it might be pretty rigorous. On the other hand, if it’s used with the general student body to see what students know about balancing budgets and managing student loans, the content and difficulty might change. Clearly, you can’t develop a test without knowing its purpose. Furthermore, a test designed for one purpose may not function well for another.

Take a minute to think about some of the tests you’ve used or taken in the past. How would you express the purposes of these tests? When answering this question, be careful to avoid simply saying that the purpose of the test is to measure something. A statement of test purpose should clarify what can be done with the resulting scores. For example, scores from placement tests are used to determine what courses a student should take or identify students in need of certain instructional resources. Scores on admissions tests inform the selection of applicants for entrance to a college or university. Scores on certification and licensure exams are used to verify that examinees have the knowledge, skills, and abilities required for practice in a given profession. Table 1.1 includes these and a few more examples. In each case, scores are intended to be used in a specific way.



Table 1.1: Intended Uses for Some Common Types of Standardized Tests
Test Type Intended Use
Accountability Hold various people responsible for student learning
Admissions Selection for entrance to an educational institution
Employment Help in hiring and promotion of employees
Exit Testing Check for mastery of content required for graduation
Licensing Verify that candidates are fit for practice
Placement Selecting coursework or instructional needs

Here’s another example that I’ll use throughout this course. Some of my work and research is based on a type of standardized placement testing that is used to measure student growth over a short period of time. In addition to measuring growth, scores are also used to evaluate the effectiveness of intervention programs, where effective interventions lead to positive results for students. My latest project involved measures of early literacy called myIGDIs (Bradfield et al., 2014). A brochure for the measures from www.myigdis.com states,

myIGDIs are a comprehensive set of assessments for monitoring the growth and development of young children. myIGDIs are easy to collect, sensitive to small changes in children’s achievement, and mark progress toward a long-term desired outcome. For these reasons, myIGDIs are an excellent choice for monitoring English Language Learners and making more informed Special Education evaluations.

Note that these are some specific and ambitious claims. Validity evidence is needed to demonstrate that scores can effectively be used in this way.

The point of these examples is simply to clarify what goes into a statement of purpose, and why a well articulated purpose is an essential first step to measurement. We’ll come back to validation of test purpose in Chapters 2 and 9. For now, you just need to be familiar with how a test purpose is phrased and why it’s important.

Summary

To summarize this section, the measurement process allows us to capture information about individuals that can be used to describe their standing on a variety of constructs, from educational ones, like math ability and vocabulary knowledge, to psychological ones, like sociability and aggression. We measure these properties by operationalizing our construct, for example, in terms of the number of items answered correctly or the number of times individuals exhibit a certain behavior. These operational variables are then assumed to represent our construct of interest. Finally, our measures of these constructs can then be used for specific purposes, such as to inform research questions about the relationship between sociability and aggression, or to measure growth in early literacy.

So, measurement involves a construct that we don’t directly observe and an operationalization of it that we do observe. Our measurement is said to be effective when there is a strong connection between the two, which is best obtained when our measurement has a clear purpose. In the next two sections, on measurement scales and scoring, we’ll focus on how to handle the operational side of measurement. Then, with measurement models, we’ll consider the construct side. Finally, in the section on score referencing, we’ll talk about additional labels that we use to give meaning to our scores.

Measurement Scales

Now that we’ve established what measurement is, and some key features that make the measurement process good, we can get into the details of how measurement is carried out. As defined by Stevens (1946), measurement involves the assignment of values to objects according to certain rules. The rules that guide the measurement process determine the type of measurement scale that is produced and the statistics that can be used with that scale.

Four types of scales

Measurement scales are grouped into four different types. These differ in the meaning that is given to the values that are assigned, and the relationship between these values for a given variable.

Nominal

The most basic measurement scale is really the absence of a scale, because the values used are simple categories or names, rather than quantities of a variable. For this reason it is referred to as a nominal scale, where people are grouped qualitatively, for example by gender or political party. The nominal scale can also represent variables such as zip code or eye color, where multiple categories are present. So, identifying variables such as student last name or school ID are also considered nominal.

Only frequencies, proportions, and percentages (and related nonparametric statistics) are permitted with nominal variables. Means and standard deviations (and related parametric statistics) do not work. It would be meaningless to calculate something like an average gender or eye color, because nominal variables lack any inherent ordering or quantity in their values.

Ordinal

The dominant feature of the ordinal scale is order, where values do have an inherent ordering that cannot be removed without losing meaning. Common examples of ordinal scales include ranks (e.g., first, second, third, etc.), the multi-point rating scales seen in surveys (e.g., strongly disagree, disagree, etc.), and level of educational attainment.

The distance between the ordered categories in ordinal scale variables (i.e., the interval) is never established. So, the difference between first and second place does not necessarily mean the same thing as the difference between second and third. In a swimming race, first and second might differ by a matter of milliseconds, whereas second and third differ by minutes. We know that first is faster than second, and second is faster than third, but we don’t know how much faster. Note that the construct we’re measuring here is probably swimming ability, which is actually operationalized on a ratio scale, in terms of speed, but it is simplified to an ordinal scale when giving out awards.

Statistics which rely on interval level information, such as the mean, standard deviation, and all mean-based statistical tests, are still not allowed with an ordinal scale. Statistics permitted with ordinal variables include the median and any statistics based on percentiles.

Interval

Interval scales include ordered values where the distances, or intervals, between them are meaningful. Whereas an ordinal scale describes one category only as greater than, less than, or equal to another, with an interval scale the difference between categories is quantified in scale points that have a consistent meaning across the scale. With interval scales we can finally use means, standard deviations, and related parametric statistical tests.

One common example of an interval scale is test score based on number correct, where each item in a test is worth the same amount when calculating the total. When treating test scores as interval variables, we make the assumption that a difference in score points reflects a consistent difference in the construct no matter where we are on the scale. This can sometimes be problematic. A test of vocabulary could be measured on an interval scale, where each correctly defined word contributes the same amount to the total score. However, in this case we assume that each correct definition is based on the same amount of construct, vocabulary knowledge. That is, the vocabulary words need to be similar in difficulty for the students we’re testing. Otherwise, scale intervals will not have a consistent meaning. Instead, an increase in number correct will depend on the word that is answered correctly.

Another common example of an interval scale is temperature as measured in degrees centigrade or Fahrenheit. These temperature scales both have meaningful intervals, where a given increase in heat, for example, produces the same increase in degrees no matter where you are on the scale. However, a zero on the Fahrenheit or centigrade scales does not indicate an absence of the variable we are measuring, temperature. This is the key distinction between an interval and a ratio scale.

Ratio

The ratio scale is an interval scale with a meaningful absolute zero, or a point at which there is an absence of the variable measured. Whereas an interval scale describes differences between scale values in scale points, a ratio scale can compare values by ratios. A simple example is time, where 1 hour is equivalent to 2/3 hours + 1/3 hours. Other examples include counts of observations or occurrences, such as the number of aggressive or prosocial behaviors per hour, or the frequency of drug use in the past month.

Note that we often reference ratio scales when operationalizing constructs, in which case we may lose our meaningful zero point. For example, zero prosocial behaviors does in fact indicate that nothing noticeably prosocial occurred for a student over a certain period of time. However, this may not mean that a student is completely void of prosociability. In the same way, zero aggressive behaviors does not necessarily indicate an absence of aggression. Thus, when a ratio variable is used to operationalize a construct, it may necessarily lose its ratio properties.

All statistics are permitted with ratio scales, though the only ones we talk about, in addition to those available with interval scales, are statistics that let you make comparisons in scores using ratios. For example, a two hour test is twice as long as a one hour test, and five aggressive episodes is half as many as ten. However, as before, if our scale is assumed to reference some underlying construct, five aggressive episodes may not indicate twice as much aggression as ten.

Comparing scales

Progressing from nominal to ratio, the measurement scales become more descriptive of the variable they represent, and more statistical options become available. In general, the further from a nominal scale the better, as once the scale is designated it cannot be upgraded, only downgraded. For example, the variable age could be represented in the following four ways:

1.
number of days spent living, from 0 to infinity;
2.
day born within a given year, from 1 to 365;
3.
degree of youngness, including toddler, adolescent, adult, etc.; or
4.
type of youngness, such as the same as Mike, or the same as Ike.

The first of these four, a ratio scale, is the most versatile and can be converted into any of the scales below it. However, once age is defined based on a classification, such as “same as Mike,” no improvement can be made. For this reason a variable’s measurement scale should be considered in the planning stages of test design, ideally when we identify the purpose of our test.

In the social sciences, measurement with the ratio scale is difficult to achieve because our operationalizations of constructs typically don’t have meaningful zeros. So, interval scales are considered optimal, though they too are not easily obtained. Consider the sociability measure described above. What type of scale is captured by this measure? Does a zero score indicate a total absence of sociability? This is required for ratio. Does an incremental increase at one end of the scale mean the same thing as an incremental increase at the other end of the scale? This is required for interval.

Upon close examination, it is difficult to measure sociability, and most other constructs in the social sciences, with anything more than an ordinal scale. Unfortunately, an interval or ratio scale is required for the majority of statistics that we’d like to use. Along these lines, Stevens (1946, p. 679) concluded:

Most of the scales used widely and effectively by psychologists are ordinal scales. In the strictest propriety the ordinary statistics involving means and standard deviations ought not to be used with these scales, for these statistics imply a knowledge of something more than the relative rank-order of data. On the other hand, for this ‘illegal’ statisticizing there can be invoked a kind of pragmatic sanction: In numerous instances it leads to fruitful results. While the outlawing of this procedure would probably serve no good purpose, it is proper to point out that means and standard deviations computed on an ordinal scale are in error to the extent that the successive intervals on the scale are unequal in size. When only the rank-order of data is known, we should proceed cautiously with our statistics, and especially with the conclusions we draw from them.

Based on this argument, a mean sociability score is only as useful as the scale itself is interval. The less meaningful the intervals between sociability scores, the less meaningful our mean estimate will be. Thus, when designing an instrument, we need to be aware of this limitation, and do our best to improve the intervalness of our scales. When stating the purpose of a test, we need to be aware of how our construct and operationalization of it will impact our resulting scale. Finally, we need to acknowledge the limitations of our scales, especially when utilizing potentially incorrect statistics.

Scoring

This course focuses on cognitive and affective test scores as operationalizations of constructs in education and psychology. As noted above, these test scores often produce ordinal scales with some amount of meaning in their intervals. The particular rules for assigning values within these scales depend on the type of scoring mechanisms used. Here, we’ll cover the two most common scoring mechanisms, dichotomous and polytomous, and we’ll discuss how these are used to create rating scales and composite scores.

Dichotomous scoring

Dichotomous scoring refers to the assignment of one of two possible values based on a person’s performance or response to a test question. A simple example is the use of correct and incorrect to score a cognitive item response. These values are mutually exclusive, and describe the correctness of a response in the simplest terms possible, as completely incorrect or completely correct. Most cognitive tests involve at least some dichotomously scored items. Multiple-choice questions, which will be discussed further in Chapter 3, are usually scored dichotomously.

Dichotomous scoring could involve different score values, besides correct and incorrect. The most common example is scoring that represents a response of either yes or no. Affective measures, such as attitude surveys and behavior checklists, often use this type of dichotomous scoring. Depression inventories, for example, may present individuals with lists of statements that people with depression typically identify strongly with. Individuals then respond to each statement by indicating whether or not the statements are characteristic of them.

Others dichotomous scores that do not indicate the presence or absence of a construct are sometimes used, but these are not discussed here.

Polytomous scoring

Polytomous scoring simply refers to the assignment of three or more possible values for a given test question or item. In cognitive testing, a simple example is the use of rating scales to score written responses such as essays. In this case, score values may still describe the correctness of a response, but with differing levels of correctness, for example, incorrect, partially correct, and fully correct.

Polytomous scoring with cognitive tests can be less straightforward and less objective than dichotomous scoring, primarily because it usually requires the use of human raters with whom it is difficult to maintain consistent meaning of assigned categories such as partially correct. The issue of interrater reliability will be discussed in Chapter 6.

Polytomous scoring with affective or non-cognitive measures most often occurs with the use of rating scales. For example, individuals may use a rating scale to describe how much they identify with a statement, or how well a statement represents them, rather than simply saying yes or no. Such rating scales measure multiple levels of agreement (e.g., from disagree to agree) or preference (e.g., from dislike to like). In this case, because individuals provide their own responses, subjectivity in scoring is not an issue as it is with polytomous scoring in cognitive tests. Instead, the challenge with rating scales becomes ensuring that individuals interpret the rating categories in the same way. For example, strongly disagree could mean different things to different people, which will impact how the resulting scores can be compared across individuals.

Except in essay scoring and with some affective measures, individual questions, whether dichotomously or polytomously score, are rarely used to measure a construct. Instead, scores from multiple items are combined to create composite scores or rating scale scores.

Rating scales

When I was in graduate school, the professor for my introductory measurement class would chastise students when they referred to multipoint rating scales as “Likert scales.” Likert (1932) did not invent the rating scale. Instead, he detailed two methods for combining scores across multiple rating scale items to create a composite score that would be, in theory, a stronger measure of the construct than any individual item. One of these methods, which has become a standard technique in affective measurement, is to assign ordinal numerical values to each rating scale category, and then calculate a sum or average across a set of these rating scale items.

The scaling technique demonstrated by Likert (1932) involves, first, the scoring of individual rating scale items using polytomous scales. For example, response options for one set of survey questions in Likert (1932) included five categories, ranging from strongly disapprove to undecided to strongly approve. These were assigned score values of 1 through 5. Then, a total score was obtained across all items in the set, and low scores were interpreted as indicating strong disapproval and high scores were interpreted as indicating strong approval. This process could be referred to as Likert scaling. But in this course we’ll simply refer to it as composite scaling, composite scoring, or simply creating a total or average score across multiple items.

In Chapter 4 we will address rating scales in more detail. We’ll cover issues in the construction and administration of rating categories. Here, we are more concerned with the benefits of using composite scores.

Composites versus components

A composite score is simply the result of some combination of separate subscores, referred to as components. Most often, we will deal with total scores or factor scores on a test, where individual items make up the components. Factor scores refer to scores obtained from some measurement model, such as a classical test theory model, discussed in Chapter 5, or an item response theory model, discussed in Chapter 8. We will also encounter composite scores based on totals and means from rating scale items. In each case, the composite is going to be preferable to any individual component for a number of reasons.

Composite scores are preferable from a statistical standpoint because they tend to provide a more reliable and valid measure of our construct. Composites are more reliable and valid because they combine information from multiple smaller, repeated measures of the construct. These smaller components may each be limited in certain ways, or may only present a small piece of the big picture, and when combined the resulting score is more comprehensive and more easily reproduced in subsequent measurements. In Chapter 5, we’ll learn more about why reliability is expected to increase, in theory, as we increase the number of items in our composite.

For example, when measuring a construct such as attitude toward animal rights, a single item would only provide information about a specific instance of the issue. Consider the example survey items presented by Mathews and Herzog (1997, p. 171):

The Animal Attitude Scale (AAS) assesses individual differences in attitudes toward the treatment of animals... It is composed of 29 items which subjects rate on a five-point Likert scale (strongly agree to strongly disagree). Sample items include, “I do not think that there is anything wrong with using animals in medical research,” “It is morally wrong to hunt wild animals just for sport,” and “I would probably continue to use a product that I liked even though I know that its development caused pain to laboratory animals.”

By themselves, any one of these items may not reflect the full construct that we are trying to measure. A person may strongly support animal rights, except in the case of medical research. Or a person may define the phrase “that I liked,” from the third example question, in different ways so that this individual question would produce different results for people who might actually be similar in their regard for animals. A composite score will tend to wash out the limitations of individual items. (Side note from this study: a regression model showed that 25% of the variance in attitude toward animals was accounted for by gender and a personality measure of sensitivity.)

The simpler methods for creating composites, by averaging and totaling across items, are used with smaller-scale instruments to facilitate scoring and score reporting. However, the scaling of many instruments, including large-scale educational tests and psychological measures, often involves the use of measurement models.

Measurement models

Whereas a simple sum or average over a set of items lets each item contribute the same amount to the overall score, more complex measurement models can be used to estimate the different contributions of individual items to the underlying construct. These contributions can be examined in a variety of ways, as discussed in Chapters 5, 7 and 8. Together, they can provide useful information about the quality of a measure, as they help us understand the relationship between our operationalization of the construct, in terms of individual items, and the construct itself.

Measurement models represent an unobservable construct by formally incorporating a measurement theory into the measurement process. We will review two theories in this class. The first, presented in Chapter 5, is called classical test theory, and the second, presented in Chapter 8, is called item response theory (see Hambleton & Jones, 1993, who compare the two). For now, we’ll just look at the basics of what a measurement model does.

Figure 1.1 contains a visual representation of a simple measurement model where the underlying construct of sociability, shown in an oval, causes, in part, the observed responses in a set of three questions, shown in rectangles as Item 1, Item 2, and Item 3. Unobservable quantities in a measurement model are typically represented by ovals, and observable quantities by rectangles. Causation is then represented by the arrows which point from the construct to the item responses. The numbers over each arrow from the construct are the scaled factor loadings reported in D. A. Nelson et al. (2010), which represent the strength of the relationship between the items and the construct which they together define. As with a correlation coefficient, the larger the factor loading, the stronger the relationship. Thus, item 1 has the strongest relationship with the sociability factor, and item 3 has the weakest.

The other unobserved quantities in Figure 1.1 are the error terms, in the circles, which also impact responses on the three items. Without arrows linking the error terms from one to another, the model assumes that errors are independent and unrelated across items. In this case, any influence on a response that does not come from the common factor of sociability is attributed to measurement error.


PIC


Figure 1.1: A simple measurement model for sociability with three items, based on results from D. A. Nelson et al. (2010). Numbers are factor loadings and E represents unique item error.

Models such as the one in Figure 1.1 are referred to as confirmatory factor analysis models, because we propose a given structure for the relationships between constructs, error, and observations, and seek to confirm it by placing certain constraints on the relationships we estimate.

Score Scaling and Referencing

Now that we’ve discussed the measurement process we can go over some common methods for giving meaning to the scores that our measures produce. These methods are referred to as score scaling and norm and criterion score referencing. Each is discussed briefly below, with examples.

Score Scaling

Score scales are often modified to have certain properties, including smaller or larger score intervals, different midpoints, and different variability. A common example is the z-score scale, which is defined to have a mean of 0 and standard deviation (SD) of 1. Any variable having a mean and SD can be converted to z-scores, which express each score in terms of distances from the mean in SD units. Once a scale has been converted to the z-score metric, it can then be transformed to have any midpoint, via the mean, and any scaling factor, via the standard deviation. Equations for these transformations are shown below. Methods for carrying out these transformations are discussed again in Chapter 5.

To convert a variable Y from its original score scale to the z-score scale, we subtract out μY , the mean on Y , from each score, and then divide by σY , the SD of Y . The resulting z transformation of Y , labeled as Y z, is:

Y z = Y μY σY . (1.1)

Having subtracted the mean from each score, the mean of our new variable Y z is 0, and having divided each score by the SD, the SD of our new variable is 1. We can now multiply Y z by any constant s, and then add or subtract another constant value m to obtain a linearly transformed variable with mean m and SD equal to s. The new rescaled variable is labeled as Y r:

Y r = Y zs + m. (1.2)

The linear transformation of any variable Y from its original metric, with mean and SD of μY and σY , to a scale defined by a new mean and standard deviation, is obtained via the combination of these equations, as:

Y r = (Y μY ) s σY + m. (1.3)

Scale transformations are often employed in testing for one of two reasons. First, transformations can be used to express a variable in terms of a familiar mean and SD. For example, IQ scores are traditionally expressed on a scale with mean of 100 and SD of 15. In this case, Equation 1.3 is used with m = 100 and s = 15. Another popular score scale is referred to as the t-scale, with m = 50 and s = 10. Second, transformations can be used to express a variable in terms of a new and unique metric. When the GRE was revised in 2011, a new score scale was created, in part to discourage direct comparisons with the previous version of the exam. The former quantitative and verbal reasoning GRE scales ranged from 200 to 800, and the revised versions range from 130 to 170.

Norm referencing

Norm referencing gives meaning to scores by comparing them to values for a specific norm group. For example, when my kids bring home their standardized test results from school, their scores in each subject area, math and reading, are given meaning by comparing them to the distribution of scores for students across the state. A score of 22 means very little to a parent who does not have access to the test itself. However, a percentile score of 90 indicates that a student scored at or above 90% of the students in the norming group, regardless of what percentage of the test questions they answered correctly.

Norms are also frequently encountered in admissions testing. If you took something like the ACT or SAT, college admissions exams used in the US, or the GRE, the admissions test for graduate school, you’re probably familiar with the ambiguous score scales these exams use in reporting. Each scale is based on a conversion of your actual test scores to a scale that is intentionally difficult or impossible to understand. In a way, the objective in this rescaling of scores is to force you to rely on the norm referencing provided in your score report. The ACT scales range from 1 to 36, but a score of 20 on the math section doesn’t tell you a lot about how much math you know or can do. Instead, by referencing the published norms, a score of 20 tells you that you scored around the 50th percentile for all test takers, which isn’t great if you’re hoping to get into a good college.

These two examples involve simple percentile norms, where scores are compared to the full score distribution for a given norm group. Two other common types of norm referencing are grade and age norms, which are obtained by estimating the typical or average performance on a test by grade level or age.

Criterion referencing

The main limitation of norm referencing is that it only helps describe performance relative to other test takers. Criterion score referencing does the opposite. Criterion referencing gives meaning to scores by comparing them to values directly linked to the test content itself, regardless of how others perform on the content (Popham & Husek, 1969).

Educational tests supporting instructional decision making are often criterion referenced. For example, classroom assessments are used to identify course content that a student has and has not mastered, so that deficiencies can be addressed before moving forward. The vocabulary test mentioned above is one example. Others include tests used in student placement and exit testing.

Standardized state test results, which were presented above as an example of norm referencing, are also given meaning using some form of criterion referencing. The criteria in state tests are established, in part, by a panel of teachers and administrators who participate in what is referred to as a standard setting. State test standards are chosen to reflect different levels of mastery of the test content. In Nebraska, for example, two cut-off scores are chosen per test to categorize students as below the standards, meets the standards, and exceeds the standards. These categories are referred to as performance levels. Student performance can then be evaluated based on the description of typical performance for their level. Here is the performance level description for grade 5 science, meets the standard:

Overall student performance in science reflects satisfactory performance on the standards and sufficient understanding of the content at fifth grade. A student scoring at the Meets the Standards level generally draws on a broad range of scientific knowledge and skills in the areas of inquiry, physical, life, and Earth/space sciences.

The Nebraska performance categories and descriptions are available online at www.education.ne.gov/assessment. Performance level descriptions are accompanied by additional details about expected performance for students in this group on specific science concepts. For example, again for grade 5 science, meets the standard:

A student at this level generally:

The performance levels and descriptors used in standardized state tests provide general information about how a test score relates to the content that the test is designed to measure. Given their generality, these results are of limited value to teachers and parents. Instead, performance level descriptors are used for accountability purposes, for example, to assess performance at the school, district, and even the state levels in terms of the numbers of students meeting expectations.

The Beck Depression Inventory (BDI; Beck, Ward, Mendelson, Mock, & Erbaugh, 1961) is an example of criterion referencing in psychological testing. The BDI includes 21 items representing a range of depressive symptoms. Each item is scored polytomously from 0 to 3, and a total score is calculated across all of the items. Cutoff scores are then provided to identify individuals with minimal, mild, moderate, and severe depression, where lower scores indicate fewer depressive symptoms and higher scores indicate more severe depressive symptoms.

Comparing referencing methods

Although norm and criterion referencing are presented here as two distinct methods of giving meaning to test scores, they can sometimes be interrelated and thus difficult to distinguish from one another. The myIGDI testing program described above is one example of score referencing that combines both norms and criteria. These assessments were developed for measuring growth in early literacy skills in preschool and kindergarten classrooms. Students with scores falling below a cut-off value are identified as potentially being at risk for future developmental delays in reading. The cut-off score is determined in part based on a certain percentage of the test content (criterion information) and in part using mean performance of students evaluated by their teachers as being at-risk (normative information).

Norm and criterion referencing serve different purposes. Most comparisons of the two note that norm referencing is typically associated with tests designed to rank order test takers and make decisions involving comparisons among individuals, whereas criterion referencing is associated with tests designed to measure learning or mastery and make decisions about individuals and programs (e.g., Bond, 1996; Popham & Husek, 1969). These different emphases are relevant to the purpose of the test itself, and should be considered in the initial stages of test development, as discussed in Chapters 2, 3, and 4.

Summary and Homework

This chapter provides an overview of what measurement is, how measurement is carried out in terms of scaling and scoring, and how measurement is given additional meaning through the use of score referencing and scale transformation. Before moving on to Chapter 2, make sure you can respond to the learning objectives for this chapter, and the discussion questions below.

Learning objectives

1.
Define the process of measurement.
2.
Define the term construct and describe how constructs are used in measurement, with examples.
3.
Compare and contrast measurement scales, including nominal, ordinal, interval, and ratio, with examples, and identify their use in context.
4.
Compare and contrast dichotomous and polytomous scoring.
5.
Describe how rating scales are used to create composite scores.
6.
Compare and contrast composite and component scores.
7.
Create a generic measurement model and define its components.
8.
Define norm referencing and identify contexts in which it is appropriate.
9.
Compare three examples of norm referencing: grade, age, and percentile norms.
10.
Define criterion referencing and identify contexts in which it is appropriate.
11.
Describe how standards and performance levels are used in criterion referencing with standardized state tests.
12.
Compare and contrast norm and criterion score referencing, and identify their uses in context.
13.
Explain how and why linear scale transformations are used to modify scales.

Discussion questions

Having finished this chapter, you should be able to provide details about a measurement application that interests you. You’ll refer to this application in your assignments and class discussions as we move forward. Here are some questions you need to be able to answer:

1.
How would you label your construct? What terms can be used to define it?
2.
With whom would you measure this construct? Who is your object of measurement?
3.
What are the units of measurement? What values are used when assigning scores to people? What type of measurement scale will these values produce?
4.
What is the purpose in measuring your construct? How will scores be used?
5.
How is your construct commonly measured? Are there existing measures that would suit your needs?

If you’re struggling to find a measurement application that interests you, you can start with the construct that I’ll be measuring in you throughout this course. As a student, you possess an underlying construct that will hopefully increase as you read, study, practice, and contribute to group work and class discussions. This construct could be labeled assessment literacy (Stiggins, 1991). You’ll receive different scores, based on quizzes and assignments, that are intended to help you and I gauge where you are on the assessment literacy scale. Then, at the end, you’ll receive a percentage score representing how much you’ve mastered. Throughout this course, we’ll use the actual measurement that happens within it as a context for learning.

A few more discussion questions to consider:

1.
Teachers often use brief measures of oral reading fluency to see how many words students can read correctly from a passage of text in one minute. Describe how this variable could be modified to fit the four different scales of measurement.
2.
How could both norm and criterion referencing be helpful in an exam used to screen applicants for a job?
3.
How are norm and criterion referencing used in evaluating variables outside the social sciences, for example, with the measurement applications presented at the beginning of the chapter?