Abedi, Jamal. 2004. “The No Child Left behind Act and English Language Learners: Assessment and Accountability Issues.” Educational Researcher 33: 4–14.

AERA, APA, and NCME. 1999. Standards for educational and psychological testing. Washington DC: American Educational Research Association.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. doi:10.18637/jss.v067.i01.

Beck, A T, C H Ward, M Mendelson, J Mock, and J Erbaugh. 1961. “An Inventory for Measuring Depression.” Archives of General Psychiatry 4: 53–63.

Beck, AT, RA Steer, and GK Brown. 1996. “Manual for the BDI-II.” San Antonio, TX: Psychological Corporation.

Bennett, Randy Elliot. 2011. “Formative Assessment: A Critical Review.” Assessment in Education: Principles, Policy & Practice 18 (1). Taylor & Francis: 5–25.

Black, P., and D. Wiliam. 1998. “Inside the Black Box: Raising Standards Through Classroom Assessment.” Phi Delta Kappan 80: 139–48.

Bloom, Benjamin Samuel, and David R Krathwohl. 1956. “Taxonomy of Educational Objectives: The Classification of Educational Goals. Handbook I: Cognitive Domain.” Longmans.

Bond, L. A. 1996. “Norm- and Criterion-Referenced Testing.” Practical Assessment Research & Evaluation 5 (2).

Bradfield, T. A., A. C. Besner, A. K. Wackerle-Hollman, A. D. Albano, M. C. Rodriguez, and S. R. McConnell. 2014. “Redefining Individual Growth and Development Indicators: Oral Language.” Assessment for Effective Intervention 39: 233–44.

Brennan, R. L. 1992. “Generalizability Theory.” Educational Measurement: Issues and Practice 11: 27–34.

———. 2001. Generalizability Theory. New York, NY: Springer.

———. 2013. “Commentary on ‘Validating the Interpretations and Uses of Test Scores’.” Journal of Educational Measurement 50: 74–83.

Briggs, D. C. 2009. “Preparation for College Admission Exams.” Arlington, VA: National Association for College Admission Counseling.

Carter, S D. 2002. “Matching Training Methods and Factors of Cognitive Ability: A Means to Improve Training Outcomes.” Human Resource Development Quarterly 13: 71–88.

Cizek, G. J. 2010. “An Introduction to Formative Assessment.” In Handbook of Formative Assessment, edited by H. L. Andrade and G. J. Cizek, 3–17. New York, NY: Routledge.

College Board. 2012. “The SAT Report on College and Career Readiness: 2012.” New York, NY: College Board.

Cronbach, L J, and R. J. Shavelson. 2004. “My Current Thoughts on Coefficient Alpha and Successor Procedures.” Educational and Psychological Measurement 64: 391–418.

de Ayala, R. J. 2009. The Theory and Practice of Item Response Theory. New York, NY: The Guilford Press.

De Boeck, Paul, Marjan Bakker, Robert Zwitser, Michel Nivard, Abe Hofman, Francis Tuerlinckx, and Ivailo Partchev. 2011. “The Estimation of Item Response Models with the Lmer Function from the Lme4 Package in R.” Journal of Statistical Software 39 (12). American Statistical Association: 1–28.

Deno, S. L. 1985. “Curriculum-based measurement: The emerging alternative.” Exceptional Children 52: 219–32.

Deno, S. L., L. S. Fuchs, D. Marston, and J. Shin. 2001. “Using curriculum-based measurement to establish growth standards for students with learning disabilities.” School Psychology Review 30: 507–24.

Doran, H., D. Bates, P. Bliese, and M. Dowling. 2007. “Estimating the Multilevel Rasch Model: With the Lme4 Package.” Journal of Statistical Software 20 (2): 1–18.

Ebel, R. 1961. “Must All Tests Be Valid?” American Psychologist 16 (640–647).

Embretson, S. E., and S. P. Reise. 2000. Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates.

Ferketich, S. 1991. “Focus on Psychometrics: Aspects of Item Analysis.” Research in Nursing & Health 14: 165–68.

Fuchs, L. S., and D. Fuchs. 1999. “Monitoring student progress toward the development of reading competence: A review of three forms of classroom-based assessment.” School Psychology Review 28: 659–71.

Gardner, William L, and Mark J Martinko. 1996. “Using the Myers-Briggs Type Indicator to Study Managers: A Literature Review and Research Agenda.” Journal of Management 22: 45–83.

Goodwin, Laura D. 2001. “Interrater Agreement and Reliability.” Measurement in Physical Education and Exercise Science 5: 13–34.

Haladyna, T. M., and M. C. Rodriguez. 2013. Developing and Validating Test Items. New York, NY: Routledge.

Haladyna, T. M., S. M. Downing, and M. C. Rodriguez. 2002. “A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment.” Applied Measurement in Education 15: 309–34.

Hambleton, R. K., and R. W. Jones. 1993. “Comparison of Classical Test Theory and Item Response Theory and Their Applications to Test Development.” Educational Measurement: Issues and Practice, 38–47.

Harvey, R. J., and A. L. Hammer. 1999. “Item Response Theory.” The Counseling Psychologist 27: 353–83.

Haynes, S. N., D. C. S. Richard, and E. S. Kubany. 1995. “Content Validity in Psychological Assessment: A Functional Approach to Concepts and Methods.” Psychological Assessment 7: 238–47.

Hockenberry, Marilyn J, and D Wilson. 2012. Wong’s Essentials of Pediatric Nursing. St Louis, MO: Mosby.

Hursh, D. 2005. “The Growth of High-Stakes Testing in the USA: Accountability, Markets, and the Decline in Educational Equality.” British Educational Research Journal 31: 605–22.

Kane, M. T. 2013. “Validating the Interpretations and Uses of Test Scores.” Journal of Educational Measurement 50: 1–73.

Kelley, T. L. 1927. Interpretation of Educational Measurements. Yonkers, NY: World Book Co.

Kline, Paul. 1986. A Handbook of Test Construction: Introduction to Psychometric Design. New York, NY: Methuen.

Kuncel, N. R., S. A. Hezlett, and D. S. Ones. 2001. “A Comprehensive Meta-Analysis of the Predictive Validity of the Graduate Record Examinations: Implications for Graduate Student Selection and Performance.” Psychological Bulletin 127: 162–81.

Likert, R. 1932. “A Technique for the Measurement of Attitudes.” Archives of Psychology 22: 5–55.

Linn, R. L., E. L. Baker, and D. W. Betebenner. 2002. “Accountability Systems: Implications of Requirements of the No Child Left Behind Act of 2001.” Educational Researcher 31 (3–16).

Lord, F. M. 1952. “A theory of test scores.” Psychometric Monographs. No. 7.

Mathews, S., and H. A. Herzog. 1997. “Personality and Attitudes Toward the Treatment of Animals.” Society and Animals 5: 169–75.

Mehrens, W. A. 1992. “Using performance assessment for accountability purposes.” Educational Measurement: Issues and Practice 11: 3–9.

Messick, S. 1980. “Test validity and the ethics of assessment.” American Psychologist 35: 1012–27.

Meyer, A. N. D., and J. M. Logan. 2013. “Taking the Testing Effect Beyond the College Freshman: Benefits for Lifelong Learning.” Psychology and Aging 28: 142–47.

Militello, Matthew, Jason Schweid, and Stephen G Sireci. 2010. “Formative Assessment Systems: Evaluating the Fit Between School Districts’ Needs and Assessment Systems’ Characteristics.” Educational Assessment, Evaluation and Accountability 22 (1). Springer: 29–52.

Miller, C., and K. Stassun. 2014. “A Test That Fails.” Nature 510: 303–4.

Myers, I. B., M. H. McCaulley, N. L. Quenk, and A. L. Hammer. 1998. “Manual: A Guide to the Development and Use of the Myers-Briggs Type Indicator.” Palo Alto, CA: Consulting Psychologist Press.

Nelson, D. A., C. C. Robinson, C. H. Hart, A. D. Albano, and S. J. Marshall. 2010. “Italian Preschoolers’ Peer-Status Linkages with Sociability and Subtypes of Aggression and Victimization.” Social Development 19: 698–720.

Nelson, H. 2013. “Testing More, Teaching Less: What America’s Obsession with Student Testing Costs in Money and Lost Instructional Time.” American Federation of Teachers.

Nunnally, J. C, and I. H. Bernstein. 1994. Psychometric Theory. New York, NY: McGraw-Hill.

Organization for Economic Cooperation and Development. 2009. “PISA 2009 Reading Literacy Items and Scoring Guides.” Retrieved on April 20, 2016 from

Pittenger, D. J. 2005. “Cautionary Comments Regarding the Myers-Brigg Type Indicator.” Consulting Psychology Journal: Practive and Research 57: 210–21.

Pope, K S, J N Butcher, and J Seelen. 2006. The MMPI, MMPI-2, & MMPI-A in Court: A Practical Guide for Expert Witnesses and Attorneys (3rd). Washington, DC: American Psychological Association.

Popham, W. J., and T. R. Husek. 1969. “Implications of Criterion-Referenced Measurement.” Journal of Educational Measurement 6: 1–9.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

Rasch, G. 1960. Probabilistic Models for Some Intelligence and Attainment Tests. Chicago, IL: University of Chicago Press.

Raymond, M. 2001. “Job Analysis and the Specification of Content for Licensure and Certification Examinations.” Applied Measurement in Education 14: 369–415.

Rizopoulos, Dimitris. 2006. “ltm: An R Package for Latent Variable Modelling and Item Response Theory Analyses.” Journal of Statistical Software 17 (5): 1–25.

Robinson, Ken. 1999. “All Our Futures: Creativity, Culture and Education.” London: Department for Education; Employment.

Rodriguez, M. C. 2005. “Three Options Are Optimal for Multiple-Choice Items: A Meta-Analysis of 80 Years of Research.” Educational Measurement: Issues and Practice 24: 3–13.

Roediger, H. L., P. K. Agarwal, M. A. McDaniel, and K. B. McDermott. 2011. “Test-Enhanced Learning in the Classroom: Long-Term Improvements from Quizzing.” Journal of Experimental Psychology: Applied 17: 382–95.

Santelices, M. V., and M. Wilson. 2010. “Unfair Treatment? The Case of Freedle, the SAT, and the Standardization Approach to Differential Item Functioning.” Harvard Educational Review 80: 106–34.

Shavelson, R. J., and Norman L. Webb. 1991. Generalizability Theory: A Primer. SAGE Publications: Thousand Oaks, CA.

Spearman, Charles. 1904. “General Intelligence, Objectively Determined and Measured.” The American Journal of Psychology 15 (2): 201–92.

Spector, Paul E. 1992. Summated Rating Scale Construction: An Introduction. SAGE Publications.

Stevens, S. S. 1946. “On the Theory of Scales of Measurment.” Science 103: 677–80.

Stiggins, R. J. 1987. “The Design and Development of Performance Assessments.” Educational Measurement: Issues and Practice 6: 33–42.

———. 1991. “Assessment Literacy.” Phi Delta Kappan 72 (534–539).

Torrance, E. P. 1981a. “Empirical Validation of Criterion-Referenced Indicators of Creative Ability Through a Longitudinal Study.” Creative Child and Adult Quarterly 6: 136–40.

———. 1981b. “Predicting the Creativity of Elementary School Children.” Gifted Child Quarterly 25: 55–62.

US Department of Education. 2002. “A New Era: Revitalizing Special Education for Children and Their Families.” Washington, DC: US Department of Education.

Webb, Norman L. 2002. “Depth-of-Knowledge Levels for Four Content Areas.” Language Arts.

Whisman, Mark A, John E Perez, and Wiveka Ramel. 2000. “Factor Structure of the Beck Depression Inventory - Second Edition (BDI-II) in a Student Sample.” Journal of Clinical Psychology 56 (4): 545–51.

Wickham, Hadley. 2009. ggplot2: Elegant Graphics for Data Analysis. New York, NY: Springer.

Wickham, Hadley, and Winston Chang. 2016. devtools: Tools to Make Developing R Packages Easier.

Wiliam, D., and P. Black. 1996. “Meanings and consequences: A basis for distinguishing formative and summative functions of assessment?” British Educational Research Journal 22: 537–48.

Xie, Y. 2016. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.12.3.

Xie, Yihui. 2015. Bookdown: Authoring Books with R Markdown.