This book provides an introduction to the theory and application of measurement in education and psychology. Topics include test development, item writing, item analysis, reliability, dimensionality, and item response theory. These topics come together in overviews of validity and, finally, test evaluation.

Validity and test evaluation are based on both qualitative and quantitative analysis of the properties of a measure. This book addresses the qualitative side using a simple argument-based approach. The quantitative side is addressed using descriptive and inferential statistical analyses, all of which are presented and visualized within the statistical environment R (R Core Team 2017).

The intended audience for this book includes advanced undergraduate and graduate students, practitioners, researchers, and educators. Knowledge of R is not a prerequisite to using this book. However, familiarity with data analysis and introductory statistics concepts, especially ones used in the social sciences, is recommended.

Perspectives on testing

Testing has become a controversial topic in the context of education. Consider this summary by Nelson (2013) from a study on the costs of educational testing in school districts in the US:

Testing has spiraled out of control, and the related costs are unacceptably high and are taking their educational toll on students, teachers, principals and schools.

The conclusions of this study reflect a sentiment that is shared by many in the educational community, that we often rely too heavily on testing in schools, and that we do so to the detriment of students.

Those critical of educational testing in the US highlight two main problems with assessment mandated from the top down. The first is an over reliance on tests in decision making at all levels, including decisions that impact students, teachers, administrators, and other stakeholders. There are too many tests, given too often, with too high a cost, financially and in terms of lost instructional time. Nelson (2013, 3) notes that “if testing were abandoned altogether, one school district in this study could add from 20 to 40 minutes of instruction to each school day for most grades.”

The second problem is a reliance on tests which are not relevant enough to what is being taught and learned in the classroom. There is a lack of cohesion, a mismatch in content, where teachers, students, and tests are not on the same page. As a result, the tests become frustrating, unpleasant, and less meaningful to the people who matter most, those who take them and those who administer or oversee them at the classroom level.

Both of these problems identified in our testing agenda have to do with what is typically regarded as the pinnacle of quality testing, the all-encompassing, all-powerful validity. Commenting over 50 years ago on its status for the test developer, Ebel (1961) concluded with some religious imagery that “validity has long been one of the major deities in the pantheon of the psychometrician. It is universally praised, but the good works done in its name are remarkably few” [p. 640]. Arguably, nothing is more important in the testing process than validity, the extent to which test scores accurately represent what they’re intended to measure. However, even today, it may be that “relatively few testing programs give validation the attention it deserves” (Brennan 2013, 81).

Establishing validity evidence for a test has traditionally been the responsibility of the test developer, who may only have limited interaction with test users and secondhand familiarity with the issues they face. Yet, Kane (2013) notes that from early on in the testing movement in the US, the appropriateness of testing applications at the student level was a driving force in bringing validity to prominence. He cites Kelley (1927), who observed:

The question of validity would not be raised so long as one man uses a test or examination of his own devising for his private purposes, but the purposes for which schoolmasters have used tests have been too intimately connected with the weal of their pupils to permit the validity of a test to go unchallenged. The pupil… is the dynamic force behind the validity movement. … Further, now that the same tests are used in widely scattered places and that many very different tests all going by the same name are gently recommended by their respective authors, even the most complacent schoolmen, the most autocratic, and the least in touch with pupils, are beginning to question the real fitness of a test. [pp. 13-14]

Motivation for this book

It appears that Kelley (1927) recognized in the early 1900s the same issue that we’re dealing with today. Tests are recommended or required for a variety of applications, and we (even the most complacent and autocratic) often can only wonder about their fitness. Consumers and other stakeholders in testing need access to information and tools, provided by a test author, that allow them to understand and evaluate the quality of a test. Consumers need to be informed. In a way, Kelley (1927) was promoting accessible training in educational and psychological measurement.

As a researcher and psychometrician, someone who studies methods for building and using measurement tools, I admire a good test, much like a computer programmer admires a seamless data access layer or an engineer admires a sturdy truss system. Testing can provide critical information to the systems it supports. It is essential to measuring key outcomes in both education and psychology, and can be used to enhance learning and encourage growth (e.g., Black and Wiliam 1998; Meyer and Logan 2013; Roediger et al. 2011). However, test content and methods can also become outdated and out of touch, and, as a result, they can waste time and even produce erroneous, biased, and damaging information (Santelices and Wilson 2010).

Rather than do away with testing, we need to refine and improve it. We need to clarify its function and ensure that it is fulfilling its purpose. In the end, there is clearly disagreement between the people at each end of the testing process, that is, those creating the tests and those taking them or witnessing firsthand their results, in terms of the validity or effectiveness of the endeavor. This disconnect may never go away completely, especially since an education system has many roles and objectives, inputs and outputs, not all of which are unanimously endorsed. However, there is definitely room for change and growth.

This book will prepare you to contribute to the discussion by giving you a broad overview of the test development and evaluation processes. Given this scope, some deeper topics, like dimensionality and item response theory, are covered only superficially. The purpose of this book is not to make you an expert in every topic, but to help you:

  1. recognize what makes a test useful and understand the importance of valid testing; and
  2. gain some of the basic skills needed to build and evaluate effective tests.

This book addresses affective and non-cognitive testing as well as cognitive, educational testing. Affective testing includes, for example, testing used to measure psychological attributes, perceptions, and behaviors, ones that are typically the target of clinical support rather than ones that are the target of instruction. This form of testing is just as relevant, but is also somewhat less controversial than educational testing, so it didn’t make for as strong an opening to the book.

A handful of other books provide thorough introductions to psychometric theory and applications of measurement. This book differs from the rest in its exclusive focus on introductory topics for readers with a limited background in mathematics or statistics. Experience with introductory statistical concepts is helpful but not essential. This book also provides extensive reproducible examples in R with a single, dedicated R package encapsulating most of the needed functionality.

Structure of this book

The book is divided into ten chapters. Most chapters involve analysis of data in R, and item writing activities to be completed with the Proola web app available at An introduction to R and Proola is given in Chapter 1, and examples are provided throughout the book.

  1. Introduction - An intro to the resources we’ll be using in this book, with an overview in R of some introductory statistical topics.
  2. Measurement, Scales, Scoring - See what measurement really means, and look at the different procedures used to carry it out.
  3. Testing Applications - Review features of the testing process and terms used to describe and categorize tests, with examples of popular measures.
  4. Test Development - Background on cognitive and noncognitive test development, with an overview of item writing guidelines.
  5. Reliability - A key topic in measurement, introduced via classical test theory.
  6. Item Analysis - Classical descriptive analysis of item-level data, with an emphasis on difficulty, discrimination, and contribution to internal consistency.
  7. Item Response Theory - Also known as modern measurement theory, a latent variable modeling approach to item analysis and test construction.
  8. Factor Analysis - A brief overview of exploratory and confirmatory factor analysis, with applications.
  9. Validity - The pinnacle of the test development process, helping us evaluate the extent to which a test measures what it is intended to measure.
  10. Test Evaluation - A summary of Chapters 1 through 9 applied to the evaluation and comparison of educational and psychological tests.

Learning objectives

Each chapter in this book is structured around a set of learning objectives appearing at the beginning of the corresponding chapter. These learning objectives capture, on a superficial and summary level, all the material you’d be expected to master in a course accompanying this book. Any assignments, quizzes, or class discussions should be built around these learning objectives, or a subset of them. The full list of objectives is in Appendix A.


Each chapter ends with self assessments and other activities for stretching and building understanding. These exercises include discussion questions and analyses in R. As this is a book on assessment, you’ll also write your own questions, with the opportunity to submit them to an online learning community for feedback.


This book is published under a Creative Commons Attribution 4.0 International License, which allows for adaptation and redistribution as long as appropriate credit is given. The source code for the book is available at Comments and contributions are welcome.


R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

Nelson, H. 2013. “Testing More, Teaching Less: What America’s Obsession with Student Testing Costs in Money and Lost Instructional Time.” American Federation of Teachers.

Ebel, R. 1961. “Must All Tests Be Valid?” American Psychologist 16 (640–647).

Brennan, R. 2013. “Commentary on ‘Validating the Interpretations and Uses of Test Scores’.” Journal of Educational Measurement 50: 74–83.

Kane, M. T. 2013. “Validating the Interpretations and Uses of Test Scores.” Journal of Educational Measurement 50: 1–73.

Kelley, T. L. 1927. Interpretation of Educational Measurements. Yonkers, NY: World Book Co.

Black, P., and D. Wiliam. 1998. “Inside the Black Box: Raising Standards Through Classroom Assessment.” Phi Delta Kappan 80: 139–48.

Meyer, A. N. D., and J. M. Logan. 2013. “Taking the Testing Effect Beyond the College Freshman: Benefits for Lifelong Learning.” Psychology and Aging 28: 142–47.

Roediger, H. L., P. K. Agarwal, M. A. McDaniel, and K. B. McDermott. 2011. “Test-Enhanced Learning in the Classroom: Long-Term Improvements from Quizzing.” Journal of Experimental Psychology: Applied 17: 382–95.

Santelices, M. V., and M. Wilson. 2010. “Unfair Treatment? The Case of Freedle, the SAT, and the Standardization Approach to Differential Item Functioning.” Harvard Educational Review 80: 106–34.