Just how reliable are exam grades?

Exploding the myth that exam grades are objective

Dennis Sherwood Creativity consultant, and campaigner for reliable assessments. Formerly a consultant to OFQUAL.

In reply to a question posed at the Select Committee hearing on 2nd September 2020, Dame Glenys Stacey, Ofqual’s Interim Chief Regulator, made an important statement: that exam grades are “reliable to one grade either way”.

And to an earlier question, Dr Michelle Meadows, Ofqual’s Executive Director for Strategy, Risk and Research, said how she “took solace … that 98% of A level, and 96% of GCSE grades are accurate plus or minus one grade”.

What do those words mean? What is their significance – especially to that candidate whose grade 3 in GCSE Maths has just slammed all those doors shut? Might the grade have been a 4?


Digging deeper

Those figures – 98% and 96% – sound so reassuring. But these statements don’t tell us what we need to know: the actual reliability of exam grades. For this, we need Figure 12 of Ofqual’s November 2018 report, Marking Consistency Metrics – An update:

For each of the 14 subjects, the heavy line in the darker blue box answers this question:

“If the scripts submitted by an entire subject cohort were fairly and independently marked twice – by an ‘ordinary examiner’ and then by a ‘senior examiner’, whose mark determines what Ofqual call the “definitive grade” – for what percentage of scripts would the originally-awarded grade be confirmed?”


On average, 1 exam grade in 4 is wrong

If grades were fully reliable, then, for all subjects, the answer would be “100%”, corresponding to a ‘probability’ of 1.0 in Figure 12.

But this is not the case.

Rather, the average reliability varies by subject from about 96% for (all varieties of) Mathematics to about 52% for Combined English Language and Literature, and weighting each subject by the corresponding cohort gives the average reliability across the subjects shown as close to 75%*.

I know of no data for other subjects, but I have made some estimates for those not listed*, and am confident that the average reliability of all GCSE, AS and A-level grades, across all subjects, is that same number, about 75%.

Accordingly, if the 6 million scripts, as typically submitted for each summer’s exams, were to be re-marked by a senior examiner, some 4.5 million (about 75%) of the originally-awarded grades would be confirmed, and around 1.5 million (about 25%) would be changed, approximately half upwards and half downwards.

Or, in simple terms, on average, 1 exam grade in every 4 is wrong.

Counting the errors

To reconcile this with Ofqual’s acknowledgement that “98% of A-level grades and 96% of GCSE grades are accurate plus or minus one grade”, I draw on my simulation* of the exam results of 2019 A-level English Literature.

If the grades were 100% reliable, then a re-mark by a senior examiner would confirm the awards to all 40,824 candidates.

But according to Figure 12, only 58% of the entries – that’s about 23,695 candidates – would have the original grade confirmed, whatever that grade might be.

My simulation* shows that a senior examiner’s re-mark would result in a further 16,409 candidates being awarded either one grade higher (8,085) or one grade lower (8,324) than the original grade. The number of candidates given either the original grade, or one higher, or one lower, (that’s “accurate plus or minus one grade” or “reliable to one grade either way”) is therefore 23,695 + 16,409 = 40,104, this being 98% of the total cohort of 40,824, which is Ofqual’s statement.

Finally, 402 candidates would be re-graded two grades higher, and 318 two grades lower, giving a total of 720 students two grades adrift. 

The totals reconcile as

Grade confirmed 23,695 (58%)
One grade adrift 16,409 (40%) Sub-total 40,104     (98%)
Two grades adrift 720   (2%) Grand total 40,824   (100%)

Ofqual’s statements that “exam grades are reliable to one grade either way” and “98% of A-level exam grades are accurate plus or minus one grade” are true.

But they mask a deeper truth.

That, on average, 1 exam grade in every 4 is wrong.

Why are exam grades so unreliable?

The unreliability of exam grades is not primarily caused by “marking error”*; it is a consequence of Ofqual’s policy for determining grades – a policy that fails to recognise that “it is possible for two examiners to give different but appropriate marks to the same answer”. If those two “different but appropriate marks” are within the same grade width, there is no problem. But if they are on different sides of a grade boundary, there is a very big problem indeed, for the grade on the certificate results from the lottery of which examiner happened to mark the script, and which side of the grade boundary that mark happens to lie.

How do we fix the problem?

This is not difficult to fix as this article shows. And even if GCSEs follow the infamous algorithm into the dustbin, A levels, in some form, are still likely to be around, so this problem needs to be resolved.

Grades should not be “reliable one grade either way”. 

They should be “reliable”. Full stop.

* Further information available from the author on request.

Discuss this post