1 The Carnegie Unit
If you walk into any high school, the first thing you will notice is the way that the working day is organised. The overwhelming majority of schools have a quasi-identical structure driven by a timetable whereby learning is structured into periods. These periods are often in the region of 40 to 60 minutes each; many will be grouped as “doubles” of one-and-a-half hours or two hours. Often a bell will ring, and students will pack their bags and shuffle to the next class, and then the next, until the day is over. The next day will resemble much of the first, and so on. This system often propagates itself from high school right down to primary school. It is only among much younger learners that the day is structured in a more fluid, transdisciplinary, and flexible manner, allowing for projects, extended learning, even naps and outdoor learning.
Where does this ritual come from? Why is it that this model predominates?
Throughout the Middle Ages and the Enlightenment, courses of study were not standardised and, therefore, varied greatly in length and assessment method. The length of a course and the way it was assessed was decided by its teacher. In the nineteenth century, as universities started to develop across the world, particularly in the United Kingdom and in the United States, admissions teams expressed frustration at the disparity in contact time that different students had experienced in their schooling. Some might have spent over 200 hours learning a subject; others, under 100. How would admissions officers compare such situations and vouch that the student had had the right amount of learning to be eligible for entry into the university? From an assessment point of view, we can see this as a problem of reliability — inconsistent testing methods make testing unfair for students.
Harvard president Charles William Eliot responded by proposing units considered necessary for the correct amount of study to have taken place (for more on this, see Silva, 2015). For Eliot, students would have to study a subject for 120 hours to gain credit. This meant that throughout a school year of 40 weeks — i.e., 52 weeks minus roughly 12 weeks for holidays (depending on the system) — each week would contain 3 hours of courses. These 3 hours would be divided evenly across a week to make it easier for the student to establish
Of course, this model has variations, especially when students go into more depth in certain subjects, taking them as majors or at a higher level. In such circumstances, students might study fewer subjects over the course of a year (6 to 8 subjects, versus 12).
At the end of the 1900s, this credit system was endorsed by the American National Education Association. From then on, high schools would have to ensure that students followed courses for 120 hours to be awarded credit (Shedd, 2003). However, adoption was slow; it was only between 1906 and 1910, when the Carnegie Foundation made this unit of study (120 hours) a mandatory institutional condition for college professors to receive their retirement pensions, that adoption became widespread. This is why the 120 hours of study, split into periods, is called the Carnegie Unit (Carnegie Foundation, 2023).
2 Standardised Curriculum Design
The need to standardise curriculum sequencing was done in order to make the course of study at high schools more reliable in the eyes of universities. By ensuring that a unit of study complied with minimal time requirements, the relative chaos and extreme variability of learning experiences that had hitherto made admissions criteria tenuous and subjective became much more objective.
However, in standardising study units across the curriculum, the Carnegie Unit was not just making the exposure to learning across different systems and schools comparable — it was also defining, substantively, what the experience of learning meant. The pace of learning would no longer be dictated by the needs of the student or the decisions of the teacher but by the need to get through the curriculum in a certain amount of time. This would lead to the artificial prolongation of some units and, perhaps more damaging, the contraction of others. Teachers and students rushed through some units in order to fit everything into a week, for example, as designed according to administrative, rather than pedagogic, exigencies.
Learning is a complex process in which educators should use time judiciously to meet the needs of students. According to needs, pacing, and individual challenges, the time dedicated to learning should be as flexible as possible.
Gifted students typically process information more quickly than other students; whereas students who struggle to access the curriculum will need more time as information is scaffolded and repeated, chunked and reinforced. It is unhelpful to consider the process of learning, which by nature is differentiated and individualised, in terms of standardised chunks of time.
The greatest enemy of understanding is coverage. As long as you are determined to cover everything, you actually ensure that most kids are not going to understand. You’ve got to take enough time to get kids deeply involved in something so they can think about it in lots of different ways and apply it — not just at school but at home and on the street and so on. (Brandt, 1993)
For a more valid learning experience, one in which students can thrive according to their needs, educators would need to design units of study differently, allowing for more time to go deep into understanding and application.
A restructured timetable would imply a restructured assessment system, too, since there would be fewer items to score — and what would be assessed would be depth of understanding rather than coverage of knowledge.
3 Grading: Technical, Emotional, and Psychological Effects
Central to the post–nineteenth-century system of curriculum coverage and assessment is grading. The origins of this practice are debated: for some it began at Yale (Pierson, 1983); for others, at Cambridge (Postman, 1992). What is clear is that, by the end of the 1700s in some universities, professors were dividing attainment into symbols, percentages, numbers, or letters, so as to organise results into categories.
Grading is less an act of formative assessment (meaning, assessment that helps students learn by giving them qualitative feedback on what they can do to improve) and more one of summative assessment (meaning, an act of judgement at the end of a piece of work to communicate to the student what that piece of work is worth).
It is understandable to want to quantify assessment into a neat and clear system that allows evaluators and learners to situate their attainment in a straightforward fashion. However, reams of educational research point out just how damaging grades can be for learning (e.g., Black & Wiliam, 1998; Butler, 2011; Putwain, 2009).
Pulfrey, Buchs, and Butera (2011) “revealed that expectation of a grade for a task, compared with no grade, consistently induced greater adoption of performance-avoidance, but not performance-approach, goals” (p. 683). The work of Dylan Wiliam (2001; 2017) has shown how grades wash out feedback on learning, focussing students’ minds on ego and status rather than on steps for improvement.
Grades are not only considered to be particularly inefficient for learning but have several negative backwash effects on wellbeing. Högberg et al. (2021), looking at the effects of the introduction of grading in Swedish schools, found “negative health consequences of accountability policies such as testing and grading” and that there are “stronger effects on girls compared to boys […] in line with studies suggesting that girls are more sensitive to performance-based self-esteem” (p. 1). Crocker et al. (2003), in studying the effects of grading on university students in the United States, found that “bad grades led to greater drops in self-esteem [which] predicted increases in depressive symptoms for students initially more depressed” (p. 507). Wang (2016) found similar outcomes in researching the effects of grading on teenagers.
And yet the ritual of grading is extremely strong in schools, anchored as a cultural norm that seems almost impossible to displace. This is not to say that experiments to move away from grading are not abundant, for they are. In fact, as Kohn (2013) points out, research going back as far as the 1930s and 1940s (Crooks, 1933; Linder, 1940; De Zouche, 1945) pointed out the dangers and inefficacy of grading, but to little avail.
As of the writing of this book, experiments to assess students beyond and outside of grading are outnumbered massively by the global juggernaut of grading throughout the world’s high schools. I very much hope that anyone
4 Placement Tests and Cut-offs
Since grading systems are used primarily with a summative purpose (as opposed to a formative purpose), their most common application is to rank students for selection eligibility.
We can see several examples of this practice in different national systems. For example, the 11+ Test is administered to Year 6 students in some parts of the United Kingdom to determine entry into grammar schools (which are reputed to be academically rigorous). Students may only take the test once; it is essentially structured as a psychometric evaluation. In Switzerland, for students to enter the academic pathway leading to high-school certification, they must either sit examinations or obtain certain grades at the end of their middle schooling. In the United States, most universities require students to obtain certain scores on standardised admissions tests in order to be considered for admission; and in the United Kingdom, universities will set “tariffs” for entry, meaning that, to be admitted, students must achieve a certain grade at the end of high school.
Selective schools will demand that students submit either a certain grade average, a certain performance on a placement test, or a certain IQ test score in order to be considered for admission.
Schools running special education programmes or streams for gifted and talented students will often require certain IQ test scores to determine who gets into the course and who does not.
The purpose of these selective entry mechanisms is to make sure that students with a certain intellectual and/or cognitive profile are admitted. In most cases, this means that there is less pressure on the schools to raise admitted students’ achievement since students entering the system are already academically groomed, good test-takers, and high achievers. One might ask what the fundamental educative purpose of selective educational systems is: since the premise of education should be to improve learning, it would make more sense for schools to accept the lower-scoring students in order to provide what is known as “value added” to their learning.
One problem that these selective mechanisms cause is the notion of the cut-off grade. Seemingly arbitrary numbers are used to determine whether students progress to a selective institution (or section of the institution) or not. Highly selective North American universities never quote an exact SAT or ACT
Other systems are less subtle and determine very sharp cut-off points for entry. For example, in the 1920s, Terman (1926) claimed that students with an IQ of 140 or higher were “gifted”. (For a more detailed analysis of IQ cut-off points to determine giftedness, see Mcbee & Makel, 2019.)
Card and Giuliano (2015) show how, in 2005, an unnamed US school district — described as “one of the largest and most diverse school districts in the country” (p. 1) — introduced a scheme whereby “non‐disadvantaged students scoring above 130 points on [a type of IQ] test, and [second-language learners and students receiving free or reduced lunches] scoring above 115 points were eligible for referral for IQ testing” (p. 5). Such scores would, depending on subsequent IQ scores, lead to access to a remedial programme. The paper reveals how “relatively high ability students from disadvantaged backgrounds were being overlooked under the traditional referral system” (p. 3) and the “traditional referral system also misses some high ability non‐disadvantaged students” (p. 15).
So the consequences of performing above or below a threshold — which can mean, for example, how students answer one question worth just a few points — can be significant and have all sorts of implications for students’ future pathways, subsidies, or opportunities. Cut-offs are too narrow as criteria for major decisions on student opportunities; they result in many gifts being missed in the process. More enlightened assessment systems, such as those we explore later in this book, broaden assessment to prevent these narrow cut-off exercises. Unfortunately, they remain the exception: almost all British universities, for example, will select students based on a UCAS tariff points system with very sharp cut-offs.
5 Why Breaking the Checkerboard Is So Difficult
This assessment grid, from the Carnegie Unit to grading to cut-offs, is a tightly regulated and numerical checkerboard. Human potential, which is subtle, variable, culturally specific, and infinitely creative, sits uneasily on this
From Binet’s work on IQ testing through two centuries of statistical modelling being the dominant paradigm in the behavioural sciences, this checkerboard has become hardened in the central role it plays in education. Entire districts, national education systems, and even global testing schemes rely on it as an axiomatic playing field that determines practices and decisions.
To break up this checkerboard and create something else will require major upheaval, a coordinated effort across several simultaneous matrices. The work will be difficult, but it is not impossible and must remain a hope, so that the way human beings are viewed and evaluated changes.
References
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5(1), 7–74. https://doi.org/10.1080/0969595980050102
Brandt, R. (1993). On teaching for understanding: A conversation with Howard Gardner. ASCD. https://ascd.org/el/articles/on-teaching-for-understanding-a-conversation-with-howard-gardner
Butler, R. (2011). Enhancing and undermining intrinsic motivation: The effects of task‐involving and ego‐involving evaluation on interest and performance. British Journal of Educational Psychology, 58(1), 1–14. https://doi.org/10.1111/j.2044-8279.1988.tb00874.x
Card, D., & Giuliano, L. (2015). Can universal screening increase the representation of low income and minority students in gifted education? Working paper 21519. National Bureau of Economic Research.
Carnegie Foundation. (2023). What is the Carnegie Unit? https://www.carnegiefoundation.org/faqs/-carnegie-unit/
Crocker, J., Karpinski, A., Quinn, D. M., & Chase, S. K. (2003). When grades determine self-worth: Consequences of contingent self-worth for male and female engineering and psychology majors. Journal of Personal and Social Psychology, 85(3), 507–516. https://doi.org/10.1037/0022-3514.85.3.507. PMID: 14498786
Crooks, A. D. (1933). Marks and marking systems: A digest. Journal of Educational Research, 27(4), 259–272.
De Zouche, D. (1945). “The wound is mortal”: Marks, honors, unsound activities. The Clearing House, 19(6), 339–344.
Glassman, S., & Swanston, B. (2024). Get accepted: What is the average SAT score needed for college admission? Forbes Advisor (updated February 7). https://www.forbes.com/advisor/education/student-resources/average-sat-score/
Högberg, B., Lindgren, J., Johansson, K., Strandh, M., & Petersen, S. (2021). Consequences of school grading systems on adolescent health: Evidence from a Swedish school reform. Journal of Education Policy, 36(1), 84–106. https://doi.org/10.1080/02680939.2019.1686540
Kohn, A. (2013). The case against grades. Counterpoints, 451, 143–153. http://www.jstor.org/stable/42982088
Mcbee, M., & Makel, M. (2019). The quantitative implications of definitions of giftedness. AERA Open, 5(1). https://doi.org/10.1177/2332858419831007
Meyer, J. H. F., & Land, R. (2006). Threshold concepts and troublesome knowledge: Issues of liminality. In J. H. F. Meyer & R. Land (Eds.), Overcoming barriers to student understanding: Threshold concepts and troublesome knowledge (pp. 19–32). Routledge.
Pierson, G. (1983). C. Undergraduate studies: Yale College. A Yale book of numbers: Historical statistics of the college and university 1701–1976. Yale Office of Institutional Research.
Postman, N. (1992). Technopoly: The surrender of culture to technology. Alfred A. Knopf.
Pulfrey, C., Buchs, C., & Butera, F. (2011). Why grades engender performance-avoidance goals: The mediating role of autonomous motivation. Journal of Educational Psychology, 103(3), 683–700. https://doi.org/10.1037/a0023911
Putwain, D. W. (2009). Assessment and examination stress in Key Stage 4. British Educational Research Journal, 35(3), 391–411. http://doi.org/10.1080/01411920802044404
Shedd, J. (2003). The history of the student credit hour. New Directions for Higher Education, 122(Summer), 5–12. http://doi.org/10.1002/he.106
Silva, E. (2015). The Carnegie unit: A century-old standard in a changing education landscape. Carnegie Foundation for the Advancement of Teaching.
Terman, L. M. (Ed.). (1926). Genetic studies of genius: Mental and physical traits of a thousand gifted children (Vol. 1, 2nd ed.). Stanford University Press.
Wang, L. C. (2016). The effect of high-stakes testing on suicidal ideation of teenagers with reference-dependent preferences. Journal of Population Economics, 29(2), 345–364. http://doi.org/10.1007/s00148-015-0575-7
Wiliam, D. (2001). What is wrong with our educational assessments and what can be done about it? Education Review, 15, 57–62.
Wiliam, D. (2017). Embedded formative assessment. 2nd ed. Solution Tree Press.