Looking at the “Girls do better when their math exams are graded anonymously” paper

Last week the NYT had an article about girls and math that caused quite a stir. For example:

Girls Outscore Boys on Math Tests, Unless Teachers See Their Names

All of the mentions of the underlying paper by Victor Lavy and Edith Sand made me want to look a little deeper, so I bought a copy of the paper yesterday. It cost $5 if you are interested in getting a copy for yourself:

On The Origins of Gender Human Capital Gaps: Short and Long Term Consequences of Teachers’ Stereotypical Biases

After reading the paper, I think the headlines have gotten a little in front what the paper actually says. For example, this statement on page 11 of the paper might be surprising given the headlines:

“The distributions of this measure by subject are presented in Figure 1. English teachers in primary school over-assess girls (mean is -0.74) and the same pattern is seen for Hebrew teachers (mean is -0.41). Math teachers’ assessment in primary school, on the other hand, is on average gender neutral (0.01).”

The idea in the paper that seemed to get the most attention was the idea that girls did better when the graders didn’t know their names. The numbers backing up this claim are presented in Table 2 on page 32 of the paper.

With just over 4,000 boys and 4,000 girls tested, here were the results:

The “School” exam that was graded by teachers, with the overall exam average set to 0:

Boys: Mean score of 0.052 with a standard deviation of 0.985
Girls: Mean score of 0.003 with a standard deviation of 0.971

The “National” exam that was graded anonymously, with the overall exam average set to 0:

Boys: Mean score of -0.014 with a standard deviation of 1.034
Girls: Mean score of 0.014 with a standard deviation of 0.963

[post publication note – I had incorrectly typed “standard error” rather than “standard deviation” in the tables above when I published this. An accidental typo on my part. Table 2 in the paper clearly states they are presenting standard deviations not standard errors. Sorry about any confusion that caused.]

So, indeed the girls performed worse than the boys in the “school” exam and better than the boys in the “national” exam. But, the difference is miniscule – both groups essentially performed identically in both exams.

Maybe a little context helps understand the numbers, so I thought of the numbers this way:

Imagine that the 4,000 boys and 4,000 girls were flipping coins 100 times each in both exams. So, we’d expect to see “heads” 200,000 times from the boys and the girls in both exams.

In the first exam, the boys averaged (0.052) / (0.985) = 5.27% of a standard deviation higher. The standard deviation here is \sqrt{400,000}/2 = 316, so the boys got about 17 more “heads” than they were expecting to get. Similarly, the girls got 316*(0.003)/(0.971), or about 1 extra head.

So, in the “school” exam, instead of the 200,000 expected heads the boys got 200,017 heads and the girls got 200,001.

In the national exam the analogous numbers are: boys – 199,996 heads and girls – 200,005 heads.

Assuming that I’ve understood the presentation of numbers in the paper correctly (and I’d love for someone to double check my numbers here), I’m going to have a hard time attributing the difference between those two results to teacher bias, or anything else for that matter. You wouldn’t expect a measure of any group to be EXACTLY the same on two different tests. What we saw here was that the difference was next to nothing. Also, the difference in performance between the “anonymous” grading and “non-anonymous” grading for both the boys and for the girls seems to be, well, practically 0.



3 Comments so far. Leave a comment below.
  1. Even if the result was statistically significant it doesn’t mean that it is big enogh to worry about. Newspaper people ought to go on stats course!

  2. You expect this from lesser papers, but it’s a shame that almost all journalism these days seems to be about grabbing eyeballs or clicks with quick, catchy headlines, instead of serious, nuanced or careful analysis. It only lends more ammo to the growing anti-science crowd in this country when so much science (and math) reporting is done sloppily.

  3. Hi Mike. Your analysis switched from using 0.985 as the standard error for the boys exam to using 0.985 as the standard deviation for the boys exam. The title of Table 2 in the NBER paper indicates that the table presents standard deviations, but the note at the bottom of Table 2 indicates that the table presents standard errors. I’d guess that the parenthetical values in Table 2 are standard deviations, but I’m not sure that it’s necessary to guess: the 0.077 value for math in column 7 is reported as a difference in standardized scores, so I think that the 0.077 value indicates that the “bias” is 7.7% of a standard deviation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: