A couple years ago I helped a small part in collecting emails for researchers who had published climate change-related papers, papers that would end up being the object of study in the since-published Cook et al 2013 "consensus project" paper that has gained a fair amount of due popularity. Since then I have had very little time to devote to the Skeptical Science author team (especially considering how involved I was, merely a freshman in my latest stint), but from time to time I like to dust off my old coat and wear it around for a couple days while I explore some other climate-related subject.
The Cook et al paper's conclusions of a ~97% consensus in the scientific literature on the topic of anthropogenic global warming (AGW; that is, the question of whether humans are causing climate change) is important in its own right (very important, in particular, toward debunking the "there is no consensus" myth). I'd like to comment on some of the finer details of the data collected, in particular comparisons between the ratings that the "citizen science" community at Skeptical Science gave each paper based on wording within the abstract, and the ratings that the authors of the papers themselves gave.
One of the motivating factors for me in this post was actually an argument I got into with somebody on Facebook, on some post that Politifact ran on Marco Rubio's comments regarding climate change, and Politifact referenced this study. A relatively valid point that was brought up was that there was no good analysis of whether or not the responding authors were a representative sample of the total number of authors that were represented by their papers in the study—and, by extension, whether the sample of papers that received self-ratings was representative. It's not quite true that there couldn't be a cursory comparison made, such as between the second columns of Tables 3 and 5, but I'll start here.
Table 3 gives the breakdown of the abstract-ratings, which were given for all of the papers:
Endorse AGW: 32.6% (3896 papers)
No AGW position: 66.4% (7930 papers)
Reject AGW: 0.7% (78 papers)
Uncertain on AGW: 0.3% (40 papers)
Table 5 gives the abstract-ratings of papers whose authors responded to give self-ratings:
Endorse AGW: 36.9% (791 papers)
No AGW position or undecided: 62.5% (1339 papers)
Reject AGW: 0.6% (12 papers)
The fractions are similar. If we want to be more rigorous about this (and I do), then ~95% confidence intervals around our sample fractions are:
Endorse AGW: (34.9%, 39.0%)
No AGW position or undecided: (60.4%, 64.6%)
Reject AGW: (0.3%, 1.0%)
[Technical note: these were calculated from binomial distributions: if the cumulative probability mass function up until the sample fraction was less than 2.5% of the total PMF area, then that probability was an upper bound; and so I found the least upper bound. Similarly for lower bounds, if the PMF above the sample statistic was less than 2.5%, it was a lower bound. Perhaps these bounds should be 5%?] The figure below hopefully gives an easier to understand visualization of these numbers:
[Technical note 2: I got the actual data from here, and it doesn't quite match the numbers given in the paper. I'm unsure of the discrepancy.] If the sample was perfect—if it reflected the population fractions—then the black dots should fit atop the circles, or at least be very close. But, there is a slight bias in the ratings, in that it appears that authors of papers that were abstract-rated as endorsing AGW were more likely to respond than authors whose papers were rated as neutral. The authors of course had no way of knowing what the abstract-ratings were; it's a response bias. The authors of papers abstract-rated as rejecting AGW responded at a rate you'd expect.
There's a bit of discrepancy in the paper on how to count papers that were "uncertain" of AGW. Table 3 tosses them in with the "rejects AGW" papers, but later on in the paper they're paired with the "no stance" (which I've always taken to mean "did not comment on the direct question, "are humans the cause of climate change?") papers. This doesn't change the results appreciably since the number of "uncertain" papers is so small, but it's the difference between 97.1% and 97.9%. Since the authors who were contacted were not given an option to rate their papers as "uncertain", I'll presume there's no difference between the "uncertain" papers and the "no stance" papers, so any of these percentages may be marginally lower than what I'm showing.
If the Skeptical Science contributors were very accurate in their ratings, predicting self-ratings with abstract ratings on the same side, then the bias would be corrected for by simply referring to the population parameters, and we'd have a 97.1% consensus (or 97.9% depending on how you allocate the "uncertain" papers).
In reality, the authors did not self-rate according to how the abstract-ratings turned out: 22.5% of papers abstract-rated as endorsing AGW were self-rated as neutral, and 1.5% as rejecting; 55.0% of papers abstract-rated as neutral were self-rated as endorsing AGW, and 1.5% as rejecting; 30% of papers abstract-rated as rejecting AGW were self-rated as endorsing AGW, the rest as rejecting. So the contributors were not very accurate, and if we use these ratios with a perfect sampling, then our consensus percentage (which should be compared to the 97.9% figure above) is 96.9%.
This is an interesting result: when you correct for contributor "error" (or "bias", how you please) and response bias from the authors, you get a smaller consensus value than by not correcting. The effect is small, but it's there. We might not have guessed this would be the case considering the high (55.0%) rate of neutral-to-endorse rate, but the fact that the aggregate statistics are the results of dynamics on subgroups, we can get wonky behavior like that. The contributors weren't conservative enough to get an artificially low consensus. (For another counter-intuitive example of subgroup statistics resulting in a different aggregate statistic, you could look to the UC-Berkeley gender bias admission issue.)
How did the "citizen scientists" do overall?
I don't think it would be out of the question to compare the abstract-rating/self-rating system with a medical test for a disease: you can get false positives, false negatives, and of course true positives and negatives. In terms of the study at hand, I'll call a false positive "abstract-ratied as neutral or positive when the author self-rated negative"; a false negative will be "abstract-rated as negative when the author self-rated as positive or neutral"; true readings are when both are positive, or both are negative, or both are neutral. So, except for "true" readings, I'm only looking at whether something rejects the consensus, as per the conditions in ratings 5/6/7 in the paper.
We already know the true positive/neutral/negative rates, from the numbers above: the true positive rate was 76.0% (or 98.5% should we choose to include the neutral papers, or 98.0% if we exclude the neutral papers altogether), the true neutral rate was 55.0%, and the true negative rate was 70%.
This is different from saying that the authors had a 76.0% (or 98.5/98.0%) chance of abstract-rating a positive given a (self-rated) positive, or 70% abstract-rating a negative given a (self-rated) negative. Rather, given an abstract-rating, these tell what the probability of the corresponding self-rating being the same is.
These are generally good numbers, not great but not bad either. It seems that when the contributors suggested something, they were generally accurate about it, or at least conservative about their estimates.
However, the false ratings tell a different story.
False positives: 39 self-rated negatives; 12 abstract-rated positives and 20 neutrals: 82.1%
False neutrals: 1–0.550 = 45.0%
False negatives: 759 self-rated neutrals, 1338 positives; 3 abstract-rated negatives: 0.14%
So the contributors were good about not accidentally labeling a paper as negative when it would be self-rated as positive or neutral, with a false negative rate of 0.14%, but they were atrocious at correctly labeling papers that were self-rated as negative. Even a conservative lower bound puts the rate at 65.0%.
To illustrate this, we can look at what the abstract-ratings were for papers that were self-rated a 5, 6, or 7:
Surprisingly, more papers abstract-rated as "positive" as a 1 or 2 were self-rated as 7 than they were 5 or 6. These discrepancies have to have some sort of explanation—the study ensured at least two contributors rated each paper, and if there was disagreement then the matter was discussed, or a third person brought in; but either way, these exchanges were anonymous between the participants (and I should know, I was around at Skeptical Science when these ratings were taking place). It's highly unlikely that these self-ratings were high to the point of giving quantified endorsement of AGW and were simply mistakes of misinterpretation on the contributors' parts, such that the authors thought their papers rejected AGW with quantification. Were the authors confused as to which end of the scale indicated endorsement when they were self-rating? I have not yet looked at these specific papers, so I cannot say right now.
But of course, one final question I want to answer: what was the average departure from the "actual" ratings (self-ratings) for each number? First, as the figure above, for the remaining self-ratings:
and second, the average abstract-rating for each self-rating, with variance bounds:
There appears to be a bias toward the endorsement side. With regard to self-ratings that are endorsement, all of the abstract-ratings were more conservative, and significantly so, so the real "bias" there is in the other direction. For papers self-rated as neutral, there is a slight bias toward the endorsement side, but the variance interval encompasses an average abstract-rating of 4 (and from the figure above this most recent we can see that the overwhelming majority of self-rated 4s were previously abstract-rated as 4s as well). For the self-ratings rejecting AGW, we saw above how these are heavily skewed toward the endorsement side, potentially due to contributor bias toward endorsement (or even contributor bias toward neutrality, a "true conservatism"), potentially due to misunderstanding of the scale by authors, so on.
This does not seem to be a very drastic source of error, and an argument could just as easily be made—perhaps most easily be made—that the contributors to the Cook et al study were erring toward the mean (toward neutrality). If any bias is unacceptable though, the results from the self-ratings still hold, even after you reduce them down from correcting for response bias and the bias in ratings. That's the main reason why the 5/6/7 self-ratings in the above figure can be so far from their expected values (assuming perfect predictive power on the part of the contributors), and the result still not be very different: there really are very few papers that reject anthropogenic global warming. I myself will probably stick to a 96.9% consensus figure, though that is not very different from the 97.1%, or 97.2%, figure from the paper.