Last week the NYT had an article about girls and math that caused quite a stir. For example:

Girls Outscore Boys on Math Tests, Unless Teachers See Their Names

All of the mentions of the underlying paper by Victor Lavy and Edith Sand made me want to look a little deeper, so I bought a copy of the paper yesterday. It cost $5 if you are interested in getting a copy for yourself:

On The Origins of Gender Human Capital Gaps: Short and Long Term Consequences of Teachers’ Stereotypical Biases

After reading the paper, I think the headlines have gotten a little in front what the paper actually says. For example, this statement on page 11 of the paper might be surprising given the headlines:

“The distributions of this measure by subject are presented in Figure 1. English teachers in primary school over-assess girls (mean is -0.74) and the same pattern is seen for Hebrew teachers (mean is -0.41). Math teachers’ assessment in primary school, on the other hand, is on average gender neutral (0.01).”

The idea in the paper that seemed to get the most attention was the idea that girls did better when the graders didn’t know their names. The numbers backing up this claim are presented in Table 2 on page 32 of the paper.

With just over 4,000 boys and 4,000 girls tested, here were the results:

The “School” exam that was graded by teachers, with the overall exam average set to 0:

Boys: Mean score of 0.052 with a standard deviation of 0.985

Girls: Mean score of 0.003 with a standard deviation of 0.971

The “National” exam that was graded anonymously, with the overall exam average set to 0:

Boys: Mean score of -0.014 with a standard deviation of 1.034

Girls: Mean score of 0.014 with a standard deviation of 0.963

[post publication note – I had incorrectly typed “standard error” rather than “standard deviation” in the tables above when I published this. An accidental typo on my part. Table 2 in the paper clearly states they are presenting standard deviations not standard errors. Sorry about any confusion that caused.]

So, indeed the girls performed worse than the boys in the “school” exam and better than the boys in the “national” exam. But, the difference is miniscule – both groups essentially performed identically in both exams.

Maybe a little context helps understand the numbers, so I thought of the numbers this way:

Imagine that the 4,000 boys and 4,000 girls were flipping coins 100 times each in both exams. So, we’d expect to see “heads” 200,000 times from the boys and the girls in both exams.

In the first exam, the boys averaged (0.052) / (0.985) = 5.27% of a standard deviation higher. The standard deviation here is = 316, so the boys got about 17 more “heads” than they were expecting to get. Similarly, the girls got 316*(0.003)/(0.971), or about 1 extra head.

So, in the “school” exam, instead of the 200,000 expected heads the boys got 200,017 heads and the girls got 200,001.

In the national exam the analogous numbers are: boys – 199,996 heads and girls – 200,005 heads.

Assuming that I’ve understood the presentation of numbers in the paper correctly (and I’d love for someone to double check my numbers here), I’m going to have a hard time attributing the difference between those two results to teacher bias, or anything else for that matter. You wouldn’t expect a measure of any group to be EXACTLY the same on two different tests. What we saw here was that the difference was next to nothing. Also, the difference in performance between the “anonymous” grading and “non-anonymous” grading for both the boys and for the girls seems to be, well, practically 0.