On September 18, 2017, the results of a national poll of American college undergraduates were published on the website of the Brookings Institution. The results were commented by the researcher who initiated the study, John Villasenor—a professor of electrical engineering at the University of California, Los Angeles, and a nonresident senior fellow at Brookings. The survey, conducted in August, received financial support from the Charles Koch Foundation. Its central topic was college students’ knowledge and attitudes towards the First Amendment. As its name suggests, the amendment is the first addition to the Constitution of the United States of America, and it deals, among other things, with the issue of “freedom of speech.” Specifically, the amendment prohibits Congress from passing any law that would curtail “freedom of speech.” The poll in question explored this issue with college students in the U.S. who are American citizens.
The poll generated controversy not only for its substantive findings (“A chilling study shows how hostile college students are toward free speech”—Washington Post), but for its methodology (“‘Junk science’: experts cast doubt on widely cited college free speech survey”—The Guardian). In this comment, I will concentrate on the latter: what is considered legitimate knowledge and what is not?
The author of the Guardian piece (09/22/2017) spoke to several polling experts. One of them, Cliff Zukin, a former president of the American Association for Public Opinion Research (AAPOR, 2005-6), was reported as saying that the professor’s survey was “malpractice,” and “junk science.” [AAPOR describes itself as “the leading professional organization of public opinion and survey research professionals in the U.S., with members from academia, media, government, the non-profit sector and private industry.” Disclosure: I am a member of this organization.] Zukin opined that the Brookings poll should never have been reported in the press. He added, somewhat nonsensically, “If it’s not a probability sample, it’s not a sample of anyone [emphasis added], it’s just 1,500 college students who happen to respond.” Another past president of AAPOR, Michael Traugott (1999-2000), was interviewed. He stated, more diplomatically, that the poll was “an interesting piece of data.” But he, as well, doubted its validity: “Whether it represents the proportion of all college students who believe this is unknown.” The current president of AAPOR, Timothy Johnson (2017-8), was also contacted. He is reported as saying that the survey was “really not appropriate.” Finally, a vice-president at Ipsos, a multinational commercial polling firm, was asked what he thought. In his view, the professor “overstate[d] the quality of his survey.” How did Villasenor go about doing that? By providing a “margin of error” said the Ipos man.
In search of the poll’s methodology
So what do we know about the way this survey was conducted? Not much. But this is not unusual for polls. Villasenor’s methodology section is minimalist—to say the least, especially when it comes to the way his sample was selected. The poll was conducted online and 1500 students responded. How did he find these students? We are not told. How many were eligible? Again, we are not told. How many eligible students were contacted? No answer. The field period started August 17 and ended August 31, 2017. Professor Villasenor tells us that he hired some polling firm to do the data collection. Which one? He does not say. He reports that the data collected were weighted with respect to gender. Indeed, his sample was about 70 percent female (N=1,040), whereas we are told that they represent 57 percent of the college population. However, college students who are American citizens are Villasenor’s target population. Since (I am guessing) students in the U.S. are overwhelmingly American citizens, the difference is probably not critical. Let us say that this poll does not meet the minimum standards of disclosure recommended by AAPOR or by the National Council of Public Polls. Is that it? Just about. He does tell his readers one more thing—something that has been described, after the fact, as a “caveat.” I quote: “To the extent that the demographics of the survey respondents (after weighting for gender) are probabilistically representative of the broader U.S. college undergraduate population, it is possible to estimate the margin of error…” What does that mean? It is a roundabout way of telling us that his sample is not a probability sample; but if you’d like to assume that it is, you can go ahead and compute a margin of sampling (an important word he omits) error. Is this assumption warranted? The author gives us no evidence to that effect.
Bad poll v. good poll
This is the crux of what the experts, quoted by The Guardian, do not like about the poll. For probability fundamentalists, if a poll is not based on a probability (i.e. random) sample, the findings are not worth the paper they are printed on. The results cannot be generalized to the population the poll purports to be studying. A probability survey is one in which each element in the target population (e.g. U.S. college students who are American citizens) has a known probability greater than zero of being selected into the sample. This is what allows pollsters to make statements about the population of interest based on the sample. From the little we are told about this poll, the sample is most likely composed of a self-selected sample of college students who are American citizens. Self-selection contravenes classical statistical theory of random sampling—think of letters of constituents sent to a member of Congress about some issue. As such, calculating a margin of sampling error is an exercise in futility.
The Guardian article ends with a positive counter-example, i.e. a “good” poll. It mentions a 2016 Gallup survey of 3,000 students asking similar questions but coming up with very different answers. The story appeared to be attributing the dissimilarities to differences in methodologies. The newspaper states that the students “had been selected in a carefully randomized process from a nationally representative group of colleges.” (More about that later.)
But some came to the rescue of the Brookings Institution’s poll. One of them was the Washington Post columnist (09/28/2017) whose first piece commented on the results under the “chilling story” headline. In her second column, entitled “Free speech and ‘good’ vs. ‘bad’ polls,” she referred to the rebukes of the poll as “disingenuous, confused or both.” She added: they “don’t render a poll ‘junk science’.” She points out correctly that a lot of “major surveys are now conducted online and use ‘non-probability’ samples.”
We actually learn more from her about the methodology Villasenor used for his survey than we did in the original report on the poll! According to this Post column, the professor contacted the Rand Survey Research Group. They advised him, we are told, on sampling methods and put him in touch with a commercial polling house (Opinion Access Corporation—OAC) that conducts online polls. This firm had in its database members of the population of interest: “college students (subsequently narrowed to college students at four-year schools only).” One wonders why the good professor couldn’t have told us that in the first place. (Note also that the Post columnist says nothing about the citizenship criterion.) Although we know a little bit more about the methodology of the Brookings’ poll, many questions remain. For instance, how does OAC recruit its panel? How many “eligible” students did it contact for the poll? Again, we are in the dark.
Another defender of the Brookings poll is a blogger for the website “reason.com” (“Is That Poll That Found College Students Don’t Value Free Speech Really ‘Junk Science’? Not So Fast”—9/30/2017). In her view, just because the poll is based on an opt-in panel is no reason to “disregard the findings.” Like her Post colleague, she argues that many reputable firms rely on this methodology. She writes: “These days, lots of well-respected outfits are doing sophisticated work outside the confines of traditional probability polls.” And she adds: “it’s a stretch to claim that any poll that uses an opt-in panel is necessarily junk”.
Controversy over Sampling
Ever since their first appearance in the mid-1990s, Internet polls have been controversial. But I think it is fair to say, though, that the controversy is dying down; the community of sample survey practitioners has had to face the facts of life, grudgingly for some: Internet polls are widely used and are here to stay—at least for the time being.
The dispute around the Brookings poll is just the latest installment on the issue of what constitutes legitimate knowledge when the source of that knowledge is a sample survey. (For an analysis of another recent flare-up see “Using online panels for election polls”.) For decades in America, the hegemonic creed that held sway over the community of sample survey practitioners was probability sampling. It was believed to be the only way to obtain reliable and valid (all other things being equal) knowledge from a poll or survey. If one happened not to practice the credo, one did so surreptitiously, being very careful not to advertise this breach—one did not wish to be labeled a deviant. Probability sampling was the “gold standard” and still is.
Nowadays, this ideal is unattainable for most researchers (or so they say), especially those in the commercial sector. With the secular decline of response rates, probability sampling is in jeopardy. As a result, non-probability sampling advocates have been emboldened. Not too long ago questioning the orthodoxy of probability sampling would simply have been inconceivable: anybody who had the audacity to suggest that there was merit in non-probability samples would have received a severe tongue lashing from the guardians of the faith. But with the rise of the Internet and response rates in the single digits, the non-probability school feels it can attack the legitimacy of sample surveys that are probability based with impunity. They argue that polls that have such low response rates cannot claim to be probability-based even though the original mechanism used to select the elements in the population to make up the sample was random. The reason for this is that high rates of non-response destroy the random (probability) quality of the sampling process. What practitioners end up with is a self-selected sample. In addition, and more importantly, it is often assumed that a high rate of non-response is associated with large non-response bias. The latter means that there is a wide gap between those who answered the polls and those who did not on the issue of interest—this is what happened, as far as anyone can tell, in the infamous Literary Digest presidential poll of 1936: respondents favored the Republican candidate (Alf Landon) and non-respondents supported FDR, the incumbent president, and ultimate winner. Of course, non-response bias can only be determined empirically, but as consumers of polls, it is wise to take the results of a low response poll with a heavy grain of salt, unless the polling house tells us that it has taken measures (e.g. non-response follow-up) to assess how different non-respondents are from respondents.
I mentioned earlier that the Guardian story gave as an example of a “carefully randomized” poll, a 2016 survey of U.S. college students conducted by the Gallup organization (Gallup, hereafter). Gallup claimed that the poll results “are based on telephone interviews with a random sample of 3,072 U.S. college students, aged 18 to 24, who are currently enrolled as full-time students at four-year colleges” (p. 32). How did it reach this final sample? It started by selecting a random sample of 240 four-year colleges. All of these colleges were contacted but only 32 agreed to participate in the survey—that’s an eighty-seven percent refusal rate. Does that make them “a nationally representative group of colleges” as the Guardian states? From these colleges, Gallup selected a random sample of 54,806 students to whom emails were sent asking them to fill out a short Internet survey, which would determine their eligibility for a telephone interview. Thirteen percent (6,928) completed the web survey, of which ninety-eight percent (6,814) were eligible and provided a telephone number. Gallup reports that the response rate to the telephone survey was 49 percent. Finally, it states that the “combined response rate for the Web recruit and telephone surveys was 6%” (.13 × .49 × 100). Of course, this response rate does not include the fact that only 32 colleges out of the originally selected 240 decided to participate in the study. But this is a moot point given the already tiny response rate. (Note, however, how much more information we are provided, regarding how this poll was conducted, compared to the Brookings poll.) So is Gallup justified in calling this sample “random”?
This Gallup poll is exactly the type that online pollsters would put forward as an example of a survey that is probability in name only, but in reality, is simply based on a self-selected sample—just like those online polls that use an opt-in panel to conduct their research. The online samplers’ point of view is presented in the reason.com piece. The author quotes the head of the election polling unit of the online company SurveyMonkey, who is reported saying: “We believe we can offer something of similar quality, at a very different price point [compared to traditional probability sampling], and with more speed.” [Emphasis added.] The same story mentions another online polling house, YouGov, described as “best in class when it comes to this type of online panel research.” Indeed, a different institute, the Cato Institute, used the services of this company to conduct a poll on topics similar to those studied by Brookings and Gallup. And like Villasenor’s survey, the Cato Institute methodology page (74) reports a “margin of error”. AAPOR in a 2013 report on non-probability sampling stated: “margin of sampling error in surveys has an accepted meaning and that this measure is not appropriate for non-probability samples” (p. 82). It added: “We believe that users of non-probability samples should be encouraged to report measures of the precision of their estimates, but suggest that, to avoid confusion, the set of terms be distinct from those currently used in probability sample surveys” (p. 82). Well, so much for that. I suppose that non-probability samplers would argue that if Gallup with a response rate of less than 6 percent can give a margin of sampling error, why can’t they?
Some history and some…sociology
It is not the first time in the history of modern polling that the community of sample survey practitioners has been divided over a methodological issue, specifically over sampling, and more precisely, over the worth of probability versus non-probability sampling. In his landmark 1934 paper, Jerzy Neyman (1894-1981) demonstrated the beneficial value of probability sampling and condemned non-probability sampling as inadequate. Up until then, both had been deemed legitimate. The legitimacy of probability sampling was derived from the fact that it rested on a solid mathematical statistics foundation. Statisticians with the U.S. federal government were quick to adopt and expand on Neyman’s ideas. According to historians Duncan and Shelton: “By about the time the United States entered World War II, probability sampling had taken root in the Federal Government” (p. 323). Academic research centers followed suit—sometime later. Not so for the commercial pollsters (Crossley, Gallup, and Roper). From the start (1935), and for many years thereafter, they relied on the non-probability technique of quota samples. In fact, it took almost a decade before the rift over which of the two methodologies was “better” to come out into the open. During that time the pollsters were never questioned about their sampling preference. In December 1944 (NY Times, 12/30/1944, p. 6) one of the first salvos directed against quota sampling came from a technical committee appointed by Congress to look into the methodology of polls. Referring back to the recent presidential election polls, the committee stated: “The quota-sampling method used, and on which principal dependence was placed, does not provide insurance that the sample drawn is a completely representative cross-section of the population eligible to vote, even with an adequate size of sample. In general, the major defects of the quota-sample method lie, first, in the method of fixing quotas, and, second, in the method of selection of respondents to interview” (p. 1294, Hearings before the Committee to Investigate Campaign Expenditures, House of Representatives, 78th Congress, 2nd Session, on H. Res. 551, Part 12, Thursday, December 28, 1944). The line of demarcation was clearly drawn.
Despite this warning, the pollsters persisted in their “misguided” ways. Between 1944 and 1948, the debate between the two camps heated up. Probability samplers were busy attacking the legitimacy of quotas and promoting their brand of sampling as “the best that statistical science has to offer,” (p. 26) as statisticians Philip Hauser and Morris Hansen put it. At their most strident the probability samplers characterized the pollsters’ methodology as “rule of thumb” and their polls as “more like straw votes than scientific instruments” (p. 557). Although faced with the ascendancy of probability samplers, pollsters and their allies fought back and refused to be stripped of their legitimacy: “Current attempts of some academicians,” Hadley Cantril countered, “to set up themselves and their work as ‘scientific’ while labeling Crossley, Gallup and Roper as rule-of-thumb operators is not, in my judgment, either justified or statesmanlike” (p. 23). In addition to this “condemning the condemners” line of attack, quota samplers used two other approaches to question the putative superiority of the probability norm: 1) there was no empirical proof that it was better than quotas; 2) it was far too expensive and too slow to implement for the pollsters’ purposes. The latter justification was a way to argue that their work was done within a different setting than the one in which federal workers and academicians, which regrouped most the probability samplers, practiced. In other words, what the pollsters were implying was that the advancing probability norm did not apply to their case. Their sampling was indeed “scientific” (see Gallup’s testimony at the 1944 Hearings, pp. 1238, 1253) despite the fact it was performed within commercial constraints—which included providing their subscribers (the press) with timely results. They did not want to convey the impression that business considerations took priority over scientific ones, but rather that they had to deploy their “science” within a much more demanding environment than did other sample survey practitioners.
For the 1948 presidential election, the pollsters essentially used the same approach they had relied on in previous elections, but this time with disastrous consequences: they all predicted, wrongly, that the incumbent president, Harry Truman, would lose to Republican challenger, Thomas Dewey. The righteous indignation expressed by some probability samplers at the 1948 poll failure was not simply a result of what they saw as norm violation (using quota sampling instead of probability sampling), but also because they feared it would affect the image and status of social science in general, and the sample survey in particular. Fortunately, the authoritative Social Science Research Council (SSRC) stepped in quickly to diffuse the apparent crisis, and to prevent the battle over sampling from taking center stage and degrading the image of social science at a time when many among natural scientists and politicians saw it as mere “political ideology”—not science (see forthcoming). Of course, the SSRC report on the polling failure chastised the pollsters for not using more up-to-date sampling methods (p. 601), but it allocated much more space to other issues that affected the forecast (e.g., last-minute shift, identifying likely voters), effectively diluting the conflict between the two schools of sampling. In fact, some observers felt that the SSRC report showed “a tendency to let down the polling organizations easily” (p. 134). Be that as it may, it is clear that after the 1948 failure, the probability norm had gained a position of dominance. Although it would take years for it to become the pollsters’ modus operandi, as early as 1949, they felt they had to pay their respects to probability sampling. For instance, Gallup at the annual conference of AAPOR that year announced that his organization was “designing a national probability sample” (pp. 765-6).
Ever since the 1930s, non-probability sampling (whatever its form) has always been second best—at least in America. In yesteryears that methodology was labeled “rule-of-thumb” and even “primitive”; today some call it “junk science” and “malpractice”. But in reality, for decades, non-probability sampling has coexisted side-by-side with its counterpart: probability sampling. The latter had been elevated, shortly after Neyman’s paper, as the only legitimate norm of practice and was endowed with much prestige. So non-probability sampling persisted in a state of what sociologists call a “patterned evasion of norm”: the discredited practice is allowed to thrive because most turn a blind eye as long as the violation is not too flagrant, i.e. if it does not call attention to itself. When it does, then the dominant norm must be reaffirmed for all to be reminded what is legitimate and what is not. This is what happened in 1948, although in a subdued way, with the SSRC report, and this is how I would interpret the reception the Hite Report received in 1988 from the officialdom of sample survey practitioners. At the AAPOR annual conference, a panel, including Shere Hite, was convened, which essentially condemned her methodology. (For an informative and entertaining description of this drama see chapter 1 of David W. Moore’s The Superpollsters, entitled “The Sins of Shere Hite.”) Her research was very controversial and got a lot of publicity. Her samples were self-selected, i.e. non-probability, but she had the audacity to claim that hers was a “scientific” study. She was told otherwise.
We can see the similarity between these historical examples and the reported reactions of Zukin and his like-minded colleagues to the Brookings poll. Perhaps theirs is a voice crying in the wilderness, and anachronistic. It might have carried some weight back in 1988, but today? The non-probability samplers would argue that there is no such thing as a true probability sample. Only a few, mostly in government, have the luxury of obtaining one. Moreover, the online poll practitioners would say that they have at their disposal an array of sophisticated statistical tools that allow them to adjust their non-probability samples in such a way that their performance is just as good (or as bad) as a so-called probability sample. To back up their statement they could point to empirical studies and election results. With the rise of the Internet and the decline of response rates, non-probability sampling’s worth and status have improved. For practitioners in the commercial sector, mostly, and their academic acolytes, probability sampling, given the current environment, fails on two counts: cost and speed. (The quota samplers of the 40s said the same.) We must not forget that, for most sample surveyors, polling is a commercial enterprise: providing their clients with timely information at a “price point” that makes business sense is a stronger imperative than trying to fulfill a norm that is, for all intents and purposes, unachievable.
So, after all of this, who are we to believe? The Brookings poll that’s been characterized as “junk science” by some, while others tell us not to “disregard [its] findings”? Or the Gallup poll that has been described as “carefully randomized”? Or the Cato Institute poll that was conducted by a polling house that has been called “best in class” when it comes to non-probability samples?
Most poll consumers, I venture to guess, perhaps wrongly, are unaware of the lingering controversy about sampling. We read stories that purport to dispense knowledge because, it is assumed, they are based on solid evidence. Take the “chilling study” column in the Washington Post. The writer reports on the Brookings study and develops her argument without once discussing the methodology of the poll. Why should I, as her reader, question its validity? If it were no good, she wouldn’t be writing about it. I might disagree with her conclusions, but polling methodology does not even cross my mind. If I happen to stumble upon the “junk science” Guardian piece, I might start having some doubts about the Post column. But then again, if, by chance, I read the reason.com article, my doubts might be dispelled. Or, I might throw my hands up in the air and decide that I can’t believe any poll!
I would think that pollsters might be concerned about the effect this lack of consensus could have on their field’s image (prestige)—especially when it is bandied about in the open. It does not promote confidence in the polls as valid purveyors of knowledge. If probability fundamentalists insist on taking a conflictual approach towards non-probability pollsters, as exemplified earlier in the Guardian piece, they are likely to cause confusion, or worse, among the attentive public. Or they can compromise and accept that there are other, possibly legitimate, ways of conducting a poll. Yet they can still derive satisfaction, if only symbolic, from the certain knowledge that theirs is the “gold standard”—however elusive it may be. They would do well to remember that we live in an age in which the practice of probability sampling is highly compromised as a result of low response rates. Ever since the 1940s probability fundamentalists have acted as if the probability norm was codified as law. But it never has. As far as I know, neither AAPOR nor the American Statistical Association has stated in their ethical guidelines of professional practice that probability sampling is the norm to follow. Of course, professionals in any field of activity will react if some egregious norm violation has taken place. Did the Brookings poll rise to that level?
In March of 2016, the American Statistical Association (ASA), “the world’s largest professional association of statisticians” (5) 1, took an unprecedented step: it issued a statement (“ASA Statement on Statistical Significance and P-values”), which was published online under the auspices of one of its publications, The American Statistician (TAS), on the “proper use and interpretation” (7) of a certain statistical measure – the “p-value”. For those of you who escaped, in High School and/or College, the blissful world of statistics, “p-value” is short for “probability value”, which is a numerical index practitioners rely on to reach a conclusion based on the data they’re analyzing. Generally speaking, we use probabilities as a measure of uncertainty: the probability of coming up Heads on the flip of a fair coin is said to be 0.5 or 50%; if you decide to play the lotto (SuperLotto Plus) in my home state of California, there is roughly one chance in 40 million that your number will come up; in other words, you’re near certain to loose – but, still, there is a chance, however infinitesimal, that you will win. (NOTE: This is not an endorsement for games of chance.)
The pronouncement, whose target audience are “researchers, practitioners and science writers who are not primarily statisticians” (3), was unheard of because never before in its long history (the association was founded in 1839) had the ASA told practitioners how to use any statistical technique or methodology. (In the interest of full disclosure I should let the reader know that I am a member of the ASA.)
How did this all come about? An introduction to the ASA statement, written by the association’s executive director, Ron Wasserstein, and the editor of TAS, Nicole Lazar, provided some background information. Its purpose is to explain what led the association’s board of directors to make the statement and the process that led to its publication. Wasserstein and Lazar identified two areas of concerns that “stimulated” (1) the board’s response: i) a recent, “highly visible” (1), and ongoing discussion within scientific journals on the questionable use of statistical methods in the process of scientific discoveries; ii) the reproducibility and replicability “crisis” (2) in science.
Regarding the first issue, Wasserstein and Lazar quote several sources that talk about statistics and its “flimsy foundation” (1), its “numerous deep flaws” (1), etc., on one side; and on the other, the defenders of statistical methods who claim that the problem is not statistics, but that a lot of data analysis is done by people who are not “properly trained [my emphasis] to perform” (1) it. The second area that spurred the ASA board to action is described thus: “The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions.” (2) “Reproducibility and replicability” means that nobody else can come up with the “scientific conclusions” the original researchers presented. For example, some readers may remember the “cold fusion” episode, back in 1989. Well, here was a case where two Utah University researchers made a claim to scientific discovery, but no one else in their field was able to replicate their findings.2 The “reproducibility and replicability crisis” has been brewing for a few years, but seems to have come to a head in August 2015 with an article in the journal Science which found that although 97% of the original studies, albeit in the field of psychology, that were scrutinized reported “statistically significant” results, only 36% of the replications did (p.944).3 In the ASA’s view, this creates “much confusion and even doubt about the validity of science” (2), and the “misunderstanding or misuse of statistical inference” (2) is partly responsible for this situation.
The authors tell us that “the Board envisioned that the ASA statement on p-values and statistical significance would shed light on an aspect of our field that is too often misunderstood and misused in the broader research community, and, in the process, provide the community a service.” (3) At the Board’s behest, Wasserstein assembled a “group of experts representing a wide variety of points of view” (3) to complete this task. Wasserstein and Lazar report that the “statement development process was lengthier and more controversial than anticipated.” (4) They also assure us that “nothing in the ASA statement is new. Statisticians and others have been sounding the alarm about these matters for decades, to little avail.” (5) They expressed the hope that the statement “would open a fresh discussion and draw renewed and vigorous attention to changing the practice of science with regards to the use of statistical inference.” (5)
The ASA’s message
Given this array of dire circumstances, e.g., some researchers throwing the “p-value” into the “dustbin of history” (more on that later), one could expect a statement full of vim and vigor, breathing fire and brimstone, and mounting a vigorous defense of one of the cornerstones of the “science of statistics”. But no. Instead, we are regaled with a pronouncement couched in very mild-language whose ambition is to clarify “several widely agreed upon principles underlying the proper use and interpretation of the p-value” (6-7). So, for example, it tells us that “the p-value can be a useful statistical measure”: hardly a ringing endorsement, but neither is it a recommendation to discard it. The ASA statement is divided into five sections: Introduction; What is a p-value?; Principles; Other Approaches; and Conclusion. When Wasserstein and Lazar in their Introduction (1-6) state that “[n]othing in the ASA statement is new” (5), they are not kidding. The contents of the Principles section of the statement, reads, for the most part, like something students taking their first introductory course in statistics would find in a widely relied upon textbook like David S. Moore’s The Basic Practice of Statistics (New York: W.H. Freeman, 1995; now in its seventh edition): for example, failing to reject the null hypothesis (H0) does not mean you have proved it to be true; rejecting the H0 does not mean it is false or that your research hypothesis (symbolized as H1) is true; statistical significance (i.e. rejecting the H0 and concluding in favor of H1) is not necessarily the same as substantive or clinical importance; etc. The beginner in statistics is, of course, entering a new cultural realm; like any other practice such as learning to be a chef, a crane operator, an automotive technician, or a brain surgeon, it is a process of socialization, i.e. it is a process that inculcates the rules and behavior that are considered appropriate within that culture. Thus, all these prescriptions are rituals used to induct you into the scientific culture of inferential statistics: the student learns the norms that define the “proper use and interpretation of the p-value” (8).
The statement claims that “misuses and misconceptions concerning p-values” are “prevalent” (11) in “the broader research community” (3) among those “who are not primarily statisticians”. It also states that “some statisticians prefer to supplement or even replace p-values with other approaches” (11), thereby encouraging “the broader research community” to do the same. One of the “other approaches” mentioned are “confidence intervals” (11). I would wager to say that there is just as much “misuses and misconceptions” (perhaps more “misconception” than “misuse”) of “confidence intervals” among “the broader research community” as there are concerning p-values. To take just one example: a scholarly book first published in 2012, which is a compendium of articles on a specific topic. (The book will remain nameless, as will the author, and the quote has been modified to insure anonymity. I do not wish to embarrass anybody, after all, errare humanum est, and I’ve done plenty of that myself, thus hardly in a position to cast the first stone.) The article in question, suitably altered, states: “Swedish public approval for paternity leave is 67% ±3 percentage points. (…) [I]n repeated samples, we would expect the true level of public support for paternity leave to fall between 64 percent and 70 percent in 95 out of 100 samples.” Clearly, this illustrates a misinterpretation of the concept of “confidence interval”, but the editors of the book did not catch it in time. However, the second edition, published four years later, corrects the mistake: “if the survey were repeated many times, 95 percent of the samples of this size would be expected to produce a margin of error that captures the true percentage of Swedes supporting paternity leave.” Therefore, advocating the use of confidence intervals in lieu of p-values does not seem to be much of a solution. Obviously, supplementing the p-value with a confidence interval would not satisfy, one would think, those who advocate its abandonment.
The ASA statement is by no means condoning the banishment of the p-value – nor is the ASA likely to do so in the future. This methodology has been with us for nearly a century and has been used, correctly or not, in multitudes of studies in a variety of disciplines that all harbor the science label. It is an elaborate scheme that has been the centerpiece of statistical practice and based on the work of heavy hitters like Ronald Fisher (1890-1962), Jerzy Neyman (1894-1981) and Egon Pearson (1885-1980).
Other “other approaches” (11) mentioned in the statement: Bayesian statistics. This is a methodology that lost out to what is often referred to as the frequentist school (Fisher, and Neyman-Pearson) back in the 1930s and 40s. These are the two major schools of inferential statistics. “Lost out” does not mean the Bayesian approach is without its aficionados: in fact, it has been used by a substantial minority in the statistical community starting in the 1950s. But it has always been treated as a second rate citizen in the world of statistics: most introductory textbooks and beyond teach the frequentist creed (null hypothesis testing and the p-value), and (just about) all the commercial software packages are programmed along that same doctrine. It is why you are more likely to be assigned as an introductory textbook the one mentioned earlier by David S. Moore, or one by Mario Triola (Elementary Statistics), rather than one by Donald A. Berry (Statistics: A Bayesian Perspective). But Bayesianism is not without controversy either…
Statistical testing by means of the null hypothesis (NHST, hereafter) and its resulting p-value is one of the cornerstones of knowledge production in many sciences. Back in 2001, a prominent statistician could write: “hypothesis testing has become the most widely used statistical tool in scientific research” (David Salsburg, The Lady Tasting Tea, p.114). Controversy about this approach is by no means new, it is almost as old as methodology itself, and it has been going on ever since – that hasn’t stopped the majority of statistics users from relying on this approach (“that’s what we’re taught; so we do what we’re taught”). So why did the ASA feel that it had to come out with a statement at this time in the history of its discipline? In other words, why this sudden urge on the part of the ASA to intervene on a topic that has been contentious for decades? I’d like to suggest a few items that might help make sense of the ASA’s action, and the tone and contents of its statement. I’m sure there are many more, but these come to mind immediately and seem to me to be important.
First, the dominance of the frequentist school is being challenged; not so much by means of debates between the two camps (those have been going on for years), but by a sort of critical-mass effect favoring the Bayesian school of thought: i.e. an increasing number of practitioners are adopting that approach. It seems to me that the frequentist creed has lost the hegemonic position it occupied for so long in the field of statistics. As David Salsburg states in his very informative history of statistics in the 20th century: “By the end of the twentieth century, [Bayesian statistics] had reached such a level of acceptability that over half the articles that appear in journals like Annals of Statistics and Biometrika now make use of Bayesian methods” (op. cit., 129-130).
The message from the ASA statement is: there is more to the practice of statistics than NHST. This is reflected in the composition of the panel of experts put together by the ASA. At first glance, and as best as I can judge, it appears to me that there is a good mix of different schools of statistical practice, including Bayesian and frequentists, which may also explain the mild language I referred to earlier. The “wide variety of points of view” (3) represented by the group of experts insured that the ASA statement would be reflective of that diversity, thus requiring compromise (“The statement development process was lengthier and more controversial than anticipated” (4)). As the introduction by Wasserstein and Lazar reports, the statement went through “multiple drafts” (4). The ASA statement clearly shows that nobody got the upper hand: there’s NHST of course, but there are also “other approaches” just as important.
I have mentioned earlier that the contents of the Principles section of the ASA statement reads like something you would find in an introductory statistics textbook. Did the ASA believe that the reaffirmation of these principles would put an end to the “misunderstanding or misuse of statistical inference” (2); that “the proper use and interpretation of the p-value” (8) would blossom as a result? After decades of the inculcation of thousands upon thousands of students in a variety of scientific disciplines into the ways of NHST, the misuse and misinterpretation of the p-value persist. It is not the reiteration of rules that folks have been taught in their first class of statistics (and repeated thereafter) that is going to change that situation. So what’s the point? I would say that the ASA statement is primarily a symbolic act. Of course, misusers and misinterpreters of statistical inference are not going to see the light suddenly after reading the Principles section of the statement. What the ASA is doing is asserting its jurisdiction over the field of statistics. It is telling those “who are not primarily statisticians” (3) that it alone has the authority to determine what qualifies as the “proper” use of statistical tools. It is not for non-statisticians to decide what in statistics should or should not be discarded. As the Introduction to the ASA statement says: “Though there was disagreement on exactly what the statement should say, there was high agreement that the ASA should be speaking out about these matters” (5). In other words, despite the diversity and discord within the ASA, it stands united when it perceives that the discipline itself is being challenged. Certainly one of the functions of a professional organization is to uphold and maintain the good reputation of its area of activity.
Although the ASA statement assigns responsibility for the “crisis”, it eschews conveniently the issue of ultimate cause. It piously and wishfully states that “that the scientific community could benefit from a formal statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value” (7-8). In the Introduction to the statement, the authors bemoan that “our field (…) is too often misunderstood and misused in the broader research community” (3). Thus the blame is placed squarely, although indirectly, on those “who are not primarily statisticians” (3) – who are, I would submit, the vast majority of statistics users. I would also venture to guess that the advent of the personal computer and the development of off-the-shelf commercial statistical software have made the access to statistical tools relatively easy, and, as a result, has increased substantially the community of those “who are not primarily statisticians” – thereby making the opportunity for misuse and misinterpretation that much more likely. Statistics is a unique field in that its products are used mostly by non-statisticians. Anytime researchers, whatever their discipline, have collected quantitative data they are most likely to make use of the tools supplied by statistics. As Neyman said, back in 1955, statistics is the “servant of all sciences.” In effect, the ASA statement is deflecting responsibility away from its own discipline and telling non-statisticians: “There is nothing wrong with our tools. We know how to use them properly. It is you, non-statisticians, who misunderstand and misuse our products.”
Let’s look more closely at the way the ASA statement frames this controversy. To counter those in the “broader scientific community” (3) who question the very usefulness of statistics (“it’s got more flaws than I’ve had hot dinners” – a Rumpolian version of Tom Siegfried’s assertion for the reader who is not Facebook-savvy)4, it reiterates the key role statistics has in the production of scientific knowledge; but, more importantly, it identifies the individual user, who is not primarily a statistician, as a causal agent in this crisis. What are the implications? First, as already mentioned, it locates the problems outside the discipline of statistics: a) there is nothing wrong with the corpus of statistical knowledge; b) statisticians know how to use the tools of their trade, and know their limits. Some who are not primarily statisticians do not. Perhaps, another delicately hidden message is that “the broader research community” (3) should rely more heavily on statisticians instead of trying to go at it alone and making a mess of it.
The ASA statement tells us: “Statisticians and others have been sounding the alarm about these matters for decades, to little avail.” (5) So, are these statistics users deviants persisting in their deviant ways? After all, they don’t follow the statistical rules, and as a result give statistics a bad name, and impede scientific progress. No. There is a difference between breaching an ethical or moral norm, and failing to follow technical prescriptions. Although the latter might give rise to chastisement and calls of incompetence, it does not bring about the indignation, the moral indignation that the former would. This also explains the tone of the ASA statement: no “fire and brimstone”. Often times, ethical norms violations call for punishment. For example, if a researcher falsifies his data and is discovered, punitive action is likely to follow: his published paper will be retracted, he might be degraded (loss of an academic title, e.g. “PhD”), he might lose his job, and he might even be sued in a court of law. Not so for the misuse or misinterpretation of a technical rule. It might cause the user some embarrassment, and she might be reprimanded; and if this misuse or misinterpretation happens to be published, it will be corrected, often quietly (i.e. without a corrigendum or erratum) at the first opportunity by a vigilant editor, as in the confidence interval example given earlier. Contrary to the moral violator or deviant, in the case of the violation of a technical norm, the culprit is not seen as somebody who willfully does so. The latter will be likely and gladly willing to mend his or her ways; not the former, who will have to be coerced into submission. The ASA has no enforcement power over honest “misuses and misinterpretations”, especially if these are perpetrated by non-members.
Perhaps, an additional implicit message from the ASA statement is that researchers outside the statistical community are not being trained in the proper use of the methodologies provided by the field of statistics. In other words, this has to do with showing the technical norms-violators the correct path: “changing the practice of science with regards to the use of statistical inference.” (5)
One important ingredient for a “problem” to get noticed (or, perhaps more precisely, for an issue to be elevated to the status of “problem” or “crisis”) and, eventually, acted upon, is for it to be covered by the mainstream media.5 The ASA statement says as much when it refers to the “highly visible [my emphasis] discussions over the last few years” (1). What it does not tell us is that until recently the controversy over NHST was confined within specific disciplines like psychology (a big consumer of statistics), sociology, epidemiology, etc., and, lest we forget, statistics. In other words, the disputes regarding that topic were happening largely behind closed doors, so to speak. But in this latest round, the controversy spilled into the general scientific media – it made it into journals that are the flagships of the scientific community: Science and Nature. As long as the NHST controversy was limited to the pages of the American Journal of Epidemiology, the Journal of Experimental Education, Quality & Quantity, or the American Sociologist, not to mention the pages of statistical journals, “nobody” really paid attention to it. But when it got splashed all over the pages of the very prestigious mainstream scientific publications mentioned, it could no longer be ignored. It became incumbent upon the ASA to intervene: it could not sit back and let the discipline be bandied about.
And then, there is data science – that topic in itself deserves a post. Data science is encroaching upon the territory traditionally occupied by the discipline of statistics. Its emergence is a rather recent phenomenon. It took many statisticians by surprise. Isn’t statistics the science of data? Statistics is often defined as the study of the methods for collecting, processing, analyzing, and interpreting quantitative data. Or more succinctly, as David Moore puts it: “Statistics is the science of learning from data.” In 2013, the president of the ASA asked in an editorial in the association’s monthly magazine: “Aren’t we data science?”6 Somehow some folks outside the field of statistics (e.g. computer science) discovered an area of data that they believed statistics could not deal with: big data! As big data and data science were commanding the attention of the Obama administration, major institutions (e.g. National Science Foundation, National Institutes of Health), and the media, traditional statisticians were being bypassed, ignored, and felt they were being left behind. Data science appeared to portray itself as an independent field, not as a specialty within the discipline of statistics, like biostatistics, for example. It seemed to be questioning the authority, and hence the legitimacy, of statistics. This threat to statistics’ customary bailiwick was taken seriously by the ASA leadership, and it responded quickly (although some ASA members would argue “not quick enough”) based on the principle “if you can’t beat them join them.” Well, not quite, it was more like: let’s try to co-opt them into our fold (“bridging the ꞌdisconnectꞌ”).7 Thus, ASA members, suddenly, saw the expression “data science” pop-up all over the place. For example, the ASA journal that started publication in 2008 (before the “data science” craze) under the title Statistical Analysis and Data Mining has now been given (as of the beginning of this year) the subtitle “The ASA Data Science Journal”. Our sisters in statistics who attended a “Women in Statistics” conference two years ago, will now (in 2016) attend a conference called “Women in Statistics and Data Science”, and may well wonder if there is not redundancy in this new title. Did the women in statistics forget to invite the women in data science back in 2014? Or did the women in statistics assume that “Woman in Statistics” was an all-inclusive title (i.e. by definition women in statistics were doing data science, what else?)?
The reader may well ask: what does all this have to do with the price of fish? Or, more appropriately, what does the emergence of data science and big data have to do with the ASA’s statement on p-values? I hope the reader will forgive me for the platitude but everything happens in a context. What I am suggesting is that the brouhaha about data science and big data is part of the wider context in which the field of statistics has had to withstand some serious probing. For example, two economists, back in 2010, wrote an entire book arguing that “Statistical significance is not a scientific test” (p.4).8 In 2015, the editors of one social science journal, Basic and Applied Social Psychology (BASP), let their readers and potential contributors know that both NHST and confidence intervals would be banned from their publication.9 In their 2016 editorial, the editors of the same periodical lament the fact that “many researchers continue to believe that p remains useful for various other reasons” (p.1).10 In other words, and in direct contradiction of the ASA statement, which would come out a month later, the BASP editors don’t believe the p-value to be a “useful statistical measure”. As a result of this worrying environment, the ASA has had to step in and defend its turf. Its statement is a declaration in defense of the integrity of the discipline and an affirmation of the organization’s jurisdiction over matters statistical.
In summary, I see the ASA statement as primarily a symbolic act that came about as a result of the hostility against statistics expressed within the scientific community in the past few years. More precisely, it is a symbolic act under the guise of being instrumental. The instrumental guise part of the statement consists in declaring that by reiterating principles widely accepted in the statistical community “the conduct or interpretation of quantitative science” could be improved. (8) The statement seeks to defend the integrity and the value of the discipline, and reaffirms its central role in the production of scientific knowledge; it reasserts the ASA’s authority over matters statistical; it establishes a clear boundary between statisticians and non-statisticians and asserts that the latter misuse the tools provided by the field of statistics, and, consequently, that they, not statistics, are one of the causes for the replicability and reproducibility crisis.
1 All numbers in parentheses refer to the pages of the document published in TAS: “ASA Statement on P-Values and Statistical Significance” (Ronald L. Wasserstein & Nicole A. Lazar (2016): “The ASA’s statement on p-values: context, process, and purpose”, The American Statistician). It can be accessed freely at the following page: http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108. To learn more about the ASA go to http://www.amstat.org/ASA/about/home.aspx.
2 “Reproducibility” refers to the inability to redo the original data analysis despite being given the data and the analytic procedures followed by the original researchers.
3 Brian Nosek, “Estimating the reproducibility of psychological science”, Science, 28 August 2015, http://science.sciencemag.org/content/349/6251/aac4716.
4 Siegfried, T. (2014), “To make science better, watch out for statistical flaws,” ScienceNews, available at https://www.sciencenews.org/blog/context/make-science-better-watch-out-statistical-flaws.
5 By “mainstream media” I mean, as will become clear shortly, that of the scientific community. Although the controversy was given some press in the mainstream lay media.
6 Marie Davidian, in Amstat News, July 2013, pp. 3-5.
7 “The ASA and Big Data”, Nathaniel Schenker, Marie Davidian, and Robert Rodriguez, in Amstat News, June, 2013, p. 4.
8 Deirdre Nansen McCloskey and Steve Ziliak. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives; Ann Arbor, MI: University of Michigan Press, 2008.
9 David Trafimow & Michael Marks (2015) Editorial, Basic and Applied Social Psychology, 37:1, 1-2.
10 David Trafimow & Michael Marks (2016) Editorial, Basic and Applied Social Psychology, 38:1, 1-2.
Although the events I relate in this post took place more than a year ago, the topic of the controversy (the use of opt-in online panels for election polling purposes) is still very much current, especially at this time of electoral contests, when we are likely to see both successes and blunders (recall the recent 2015 UK parliamentary elections).
On July 25, 2014, the New York Times (NYT), and its polling partner CBS News (CBS), made an announcement that “rocked the polling world” (Washington Post, 07/31/14). The news organizations reported that they had retained YouGov to conduct their polls for the upcoming midterm November elections. The remarkable part was that the polling house is one that bases its polls on an Internet panel, meaning folks who volunteer to take a survey from time to time. This represented a departure from NYT/CBS’s traditional approach: in the past they relied on polls that used telephones and random-digit-dialing (RDD) to reach respondents. RDD is held as the “gold standard” when it comes to polling and survey research by telephone because it conforms to the statistical theory of probability (or random) sampling. In contrast, Internet panels do not fit this theory because panel members are not selected at random; they self-select themselves to be part of the panel.
The NYT indicated that their polls would be based on “an online panel of more than 100,000 respondents nationwide” (NYT, 07/27/14). It attributed its choice to work with YouGov to the fact that “declining response rates may be complicating the ability of telephone polls to capitalize on the advantages of random sampling” (id.). In the same article, it acknowledged both the limitations of working with an online panel (“only the 81 percent of Americans (…) use the Internet”), and YouGov’s less than perfect estimates in the 2012 election (it “underestimated President Obama’s share of the Hispanic vote in 2012”). However, YouGov’s results, it affirmed, “are broadly consistent with previous data on the campaign” (id.). It also cited the serious problem that has plagued telephone sampling: “Only 9 percent of sampled households responded to traditional telephone polls in 2012, down from 21 percent in 2006 and 36 percent in 1997, according to the Pew Research Center” (id.).
In a piece dated 10/05/2014, the NYT stated, perhaps to placate those who criticized their use of the YouGov panel, “The YouGov online surveys are being used to supplement, not replace, the Times’s traditional telephone polls.” It went on to explain that the NYT/CBS “political and social surveys are conducted using random digit dialing probability sampling,” and that the “YouGov data is used for The Upshot election forecasting model in key congressional races and Senate battleground states.”
The event described above was considered “a very big deal in the survey world” by the Pew Research Center’s director of survey research, Scott Keeter 1. Days after the NYT/CBS revelation, the American Association for Public Opinion Research (AAPOR) issued a statement (08/01), signed by its then president, Michael Link, expressing its “concerns” regarding the use of “opt-in Internet” surveys 2. AAPOR is a professional organization that regroups polling and survey research practitioners that work in the private sector, in government, and in academia. (In the interest of full disclosure the reader should know that I am a member of this organization.) As such, one of its responsibilities is to police what is done in the polling industry. AAPOR chastised NYT/CBS: first, for using an Internet panel to report on an electoral contest, because this method of selecting a sample has “little grounding in theory”; second, for a lack of “transparency” regarding how the news organizations arrived at the results they published. As for this last point, the statement read: “While little information about the methodology accompanied the story, a high level overview of the methodology was posted subsequently on the polling vendor’s [i.e. YouGov] website. Unfortunately, due perhaps in part to the novelty of the approach used, many of the details required to honestly assess the methodology remain undisclosed.”
AAPOR rebuked the NYT for abandoning its high standards in matters of polling, and only telling its readers that “the old standards were undergoing review”. It also insisted that “standards need to be in place at all times.” In addition, it criticized the Times for publishing a story (NYT 05/20/2014) that reported on a study whose respondents were recruited by means of ads on Facebook. It warned that “using information from polls which are not conducted with scientific rigor in effect sets a new–lower–standard for the types of information that other news outlets may now seek to report.”
While acknowledging that the “world of polling and opinion research is indeed in the midst of significant change”, in so far as data collection, it warned that “the use of any new methods [should] be conducted within a strong framework of transparency, full disclosure and explicit standards.”
Reactions to the Reaction
Many individuals had their say about AAPOR’s statement. I will concentrate on two of the more notable (and accessible) ones – in my view. Although, predictably, there were two types of reactions to AAPOR’s announcement, for and against, the ones presented here are of the negative variety.
One response (08/05) came from a long time member of the organization, Reg Baker, on his personal blog: The Survey Geek 3. He has been part of AAPOR’s leadership, having been, among other positions, a member of its executive council. The title of his post says it all: “AAPOR gets it wrong.” What did it get wrong?
He writes: “We have well over a decade of experience showing that with appropriate adjustments these polls are just as reliable as those relying on probability sampling, which also require adjustment.” He adds: “There is a substantial literature stretching back to the 2000 elections showing that with the proper adjustments polls using online panels can be every bit as accurate as those using standard RDD samples.” Presumably Baker’s remark was in response to AAPOR stating: “we are witnessing some of the potential dangers of rushing to embrace new approaches without an adequate understanding of the limits of these nascent methodologies.” So what Baker is saying is that AAPOR is wrong on two counts: online polling is not new, and we do have “an adequate understanding of [its] limits.”
AAPOR is also wrong, Baker believes, when it says that YouGov did not provide sufficient details regarding its methodology. On the contrary, Baker asserts: “The details of YouGov’s methodology have been widely shared, including at AAPOR conferences and in peer-reviewed journals.”
He says he agrees (partially) with AAPOR on one point. The NYT, he opines, did “an exceptionally poor job of describing [the decision to use online panels] and disclosing the details of the methodologies they are now willing to accept and the specific information they will routinely publish about them. Shame on them.” But he faults AAPOR for not providing practitioners with “a full set of standards for reporting on results from online research,” despite the fact that this methodology has been around for nearly two decades and is widely used by researchers around the world. One should note that Baker was chair of a 2010 AAPOR task force on opt-in online panels. One might ask: would that not have been a good opportunity to devise “a full set of standards for reporting on results from online research”? But AAPOR’s Executive Council made it very clear that it was not in the task force’s mandate to do so. Nevertheless, the task force did give one recommendation regarding the reporting of survey results based on the opt-in methodology: that surveys based on opt-in or other self-selected samples should not report a “margin of error” as this is not appropriate for non-probability samples.
A more strident reaction, at least in its second formulation, came from a Columbia University professor of political science and statistics. At first the good professor, Andrew Gelman is his name, in a blog called “The Monkey Cage” (?), a regular feature in the Washington Post, provided a response, with his colleague David Rothschild of Microsoft, in the best tradition of polite academic dialog 4. The authors’ post, “Modern polling needs innovation, not traditionalism”, was a model of moderation and reasonableness. In it, they gave AAPOR an emphatic reverential bow, calling it “a justly well-respected organization”, and warned their readers that they were not “disinterested observers” since they collaborate with YouGov on a number of projects. They found AAPOR’s statement, although “undoubtedly well-intentioned”, “so disturbing”. Why? Because, the authors believe, AAPOR’s “rigid faith in technology and theories or ‘standards’ determined in the 1930s” is “holding back our understanding of public opinion” and “putting the industry and research at risk of being unprepared for the end of landline phones and other changes to existing ‘standards’.” Like Baker, the authors point out that YouGov’s methodology has been widely discussed in professional meetings and in peer-reviewed journals. In their view, the theory behind YouGov’s methodology is “well-founded” and “based on the general principles of adjusting for known differences between sample and population.” They add: “If anything, people on the cutting edge of research are not hiding anything; on the contrary, we are fighting hard to overcome entrenched methods by being even more diligent and transparent.”
Although not generally known, academics are human too. And, as any other member of the species, they are prone to the occasional bile-spilling. This is what happened in Gelman’s second formulation of his response to the AAPOR missive posted (08/06) on his personal blog, which rejoices under the name of “Statistical Modeling, Causal Inference, and Social Science”. The article is titled (hold on to your hats) “President of American Association of Buggy-Whip Manufacturers takes a strong stand against internal combustion engine, argues that the so-called ‘automobile’ has ‘little grounding in theory’ and that ‘results can vary widely based on the particular fuel that is used’” 5. The professor directed his ire against Michael Link. He accuses Link of having an “anti-innovation” attitude, of “making things up” to support “his” position, of “talking out of his ass” (no, I’m not making this up; go check for yourself), and of “aggressive methodological conservatism” – apparently, the latter must emit some putrid odor since it seems to have occasioned (twice) a desperate search for the vomit bag – as he reports that it “just makes me want to barf” (no, I’m still not making this up). (Fortunately, our somewhat indisposed professor did make a few substantive points – I will come to that in a moment.) In a blog later in the year (12/09: “Buggy-whip update”), he tells his readers that six days after the posting just mentioned, he sent a personal email to Link asking him to explain “his” (i.e. AAPOR’s) statement of August 1 6. He received no response. Somewhat miffed, the professor writes: “I get frustrated when people don’t respond to my queries.” Tell me about it! Now, it seems to me that it doesn’t take a very sophisticated statistical model to predict that the probability of receiving a response given Gelman’s August 6 post is much closer to zero (0=no response) than to one (1=response).
Now to the substance. Gelman makes the point that there really is no difference between a “probability” sample that has a response rate of 10% and an opt-in Internet panel – both are self-selected samples. In either case, in order to estimate what it is you are trying to estimate (e.g. the percentage a political candidate will receive), you “have to do some adjustment to correct for known differences between sample and population,” and in the process “make assumptions”. The methodology is “not new”, he says, and “a lot of research” has been done on these issues. Regarding the latter, he mentions the work of Roderick Little, an expert in the statistics of “missing data”.
A Sociological View
This controversy illustrates several themes of the sociology of science, in our case, social science:
AAPOR as the guardian of agreed upon standards for the conduct of polls and survey research is duty bound (as one AAPOR member put it, it would have been irresponsible of AAPOR not to have said something) to intervene when any of its norms have been violated – whether the violator is a member of the association or not. In its view the NYT/CBS organization had done just that when it decided to base its election forecasts on polling data that came from an opt-in Internet panel, i.e. from a non-random sample. Generally, in the past, these types of samples have been considered un-scientific. In contrast, probability (aka random) samples have been recognized, if not adopted, as the “gold standard” of sampling since the late 1940s – at least in the United States. In other words, to borrow from sociologist Thomas Gieryn, it has been the task of AAPOR to demarcate science (probability sampling) from non-science (non-probability samples) in the field of polling and survey research 7. These norms, for example, have forced news network organizations to warn their viewers, when reporting the results of a call-in poll (aka 1-800-poll or “junk” poll), that the numbers on their screens were obtained from a “non-scientific” survey.
What is considered science in the polling world has changed over the years. In the 1930s and 40s (1935-1948), the new pollsters (Crossley, Gallup, and Roper) promoted a distinctly non-probability methodology (quota sampling) as science – and (before 1948) nobody really challenged them on this as AAPOR is now challenging the NYT/CBS organization 8. Nowadays, or at least until very recently, were you to use a quota sample or any other non-random sampling methodology for your study, you were liable to get your wrists slapped (figuratively, of course) – at least, and I repeat, in the US 9. The Hite Report is a good example 10. Thus, science is what those who are empowered to say what it is say it is. And science varies depending on the era you live in and what part of the world you reside in.
The AAPOR statement is an opportunity for the association to assert its authority. It is “the leading association of public opinion and survey researchers”, and as such its credentials cannot be doubted. The statement is also the occasion the reiterate the basic tenets of the faith: “a fundamental belief in a scientific approach”; “objective standards”; polls, conducted according to “standards of quality”, “mirror reality” (that is, social reality); etc. AAPOR’s basic ruling in the August 1, 2014 release is that Internet opt-in panels are NOT quite ready for the big league – pre-election polling; they’re still wet behind the ears. The time to extend the boundaries of what is considered scientific in polling is not now, because “these new approaches and methodologies” still require “rigorous empirical testing”, etc. In other words, AAPOR re-emphasized the demarcation line between legitimate polls or surveys that provide reliable knowledge about social reality (e.g. public opinion), and “polls which are not conducted with scientific rigor” whose results are “highly questionable, if not outright incorrect”. It also stated that it is not opposed to the idea of widening the boundaries, indeed it “encourage[s] assessment of [these methodologies’] viability for measure and insight”, but this must done “within a strong framework of transparency, full disclosure and explicit standards.”
Transparency: a device to unmask illegitimate, non-scientific polls
One thing readers of the AAPOR statement might have noticed is the heavy emphasis that has been given to “transparency”. Transparency is the act of unveiling all the steps that were taken to generate the final poll results that are published: from the sampling design, question wording, and data collection mode (e.g. telephone, web), to weighting and other forms of “adjusting” the raw data. AAPOR launched its Transparency Initiative (TI) in 2014. Now, as anybody who has studied the history of polls in this country will tell you, “transparency” is not one of the pollsters’ most conspicuous virtues. Back in the 1940s, one reporter complained that he had made “several informal attempts (…) to check facts and figures” regarding Gallup polls, but that “all ended in failure” 11 (p.737). Two decades later things seemed to have improved a bit. Trying to get answers about the “Gallup system of processing” polling data, a New Yorker columnist had this to say: “By calling members of the Gallup staff, and by writing to Dr. Gallup, I was able to get answers – reluctant and incomplete, but still answers – to some of my questions about the process” 12 (p.174). Nevertheless, in the following decade, some folks were still not satisfied with the pollsters’ transparency, and not just anybody: a member of congress drafted an unsuccessful bill under the name “Truth-in-Polling Act”.
The AAPOR release states that the organization “has for decades worked to encourage disclosure of methods.” Be that as it may… but between encouragement and actual disclosure, there is a wide gap. As an example, a recent (January 2016) poll by the Harvard School of Public Health in collaboration with STAT, an organization that reports news in the health and medical field 13. One page of the 15 page report is dedicated to the poll’s methodology. Although it provides a fair amount of detail (sample size, type of sampling, dates during which the poll took place, mode of interviewing), one would be hard pressed to find any information on response rate – even though it warns the reader that non-response bias can be part of the total error of the survey. Now I am not trying to pick on the folks who did this particular survey (I just happened to receive the results in my inbox as I was writing this), or the polling house (SSRS) that actually conducted the data collection, and is a member of AAPOR’s Transparency Initiative; I am sure they are all fine upstanding researchers. I am merely illustrating that the ideal of “full disclosure” that AAPOR promotes is yet to be realized – as some gentleman from China has said, I am told, “the future is bright but the road is tortuous”.
So, one may ask, why this hard push about transparency? Answer: the Internet (at least one answer). Thanks to the advent of this technology just about anyone can do a survey or poll nowadays. Throw a few questions together (you know how to ask questions, don’t you Steve?), spend a few bucks (or loonies, if you’re in Canada) to use SurveyMonkey or some other web-survey platform, put an ad on craigslist (or elsewhere) to recruit your participants; when you’re done download the results into Excel, et voilà, you’ve got yourself a study. (Disclaimer: I want the reader to know that I am neither promoting nor endorsing the companies mentioned. I am just describing what I have witnessed during the course of my professional career.) Because this technology is so ubiquitous and seemingly user-friendly, it endangers the monopoly the polling profession has over the production of knowledge about society, in general, and public opinion, in particular. It threatens the profession in that it creates the appearance that to conduct a survey or poll no longer requires “expert” knowledge – just like the museum visitor standing in front of a Jackson Pollock painting and exclaiming “My six-year-old could’ve done that!”
The promotion of transparency is, in part, a demarcation maneuver (again to borrow from Gieryn). It is a means for AAPOR to assert its authority and to reiterate what is and what is not legitimate when it comes to polling. Those that joined the Transparency Initiative are recognized as worthy (i.e. scientific) polling practitioners; it is akin to the warning label consumers find on the package of products in the supermarkets. The “Transparency Initiative” label tells consumers that the product from a particular organization is fit for consumption, and, by extension, for those that have not integrated the TI, their products should be viewed as suspect (non-scientific).
One last thing about “transparency”: there is no such thing as “full disclosure” – at least from commercial polling houses. Polling is a business, big business, not mere idle curiosity. These companies can always invoke proprietary rights to avoid revealing how the results they publish have been produced. Thus, the gruesome details remain hidden from the public eye 14.
Redrawing the boundary: what should be considered science?
One of themes in Gelman’s response to AAPOR is the contested nature of the boundary between what constitutes sound (i.e. scientific) polling practice and what does not. As we saw AAPOR is firmly attached to the principle of probability (aka random) sampling. As I said before, this has been the central credo of the polling profession for decades. Gelman wants the boundary to be extended; he wants to push the demarcation line so that it will include non-probability samples. In reality, the line’s location has already been renegotiated since, nowadays, probability samples with response rates of 10% or less are still considered scientific 15. Gelman’s argument is that there is no difference between these types of samples and samples that recruit their respondents off the Internet (à la YouGov): they are both self-selected samples. Gelman writes: “the ‘grounding in theory’ that allows you to make claims about the nonrespondents in a traditional survey, also allows you to make claims about the people not reached in an internet survey.” In both cases, after the poll’s raw results are in, the analyst will have “to do some adjustment to correct for known differences between sample and population.”
In fact, Gelman is intimating that AAPOR seems to be unaware of this shift of the demarcation line, namely, that methodologies as the one used by YouGov are definitely inside the scientific corral. In his view, they have become a legitimate part of the polling culture. Both he and Baker state that data obtained from Internet opt-in panels polls have a solid pedigree: they have passed muster. How? By the traditional, tried-and-true, means to establish one’s claim to scientificity, or scientific worth: the peer-review system and presentations at conferences attended by one’s peers. Baker writes: “There is a substantial literature stretching back to the 2000 elections showing that with the proper adjustments polls using online panels can be every bit as accurate as those using standard RDD samples.” He could have added “or inaccurate” to his statement for the sake of completeness.
Just as AAPOR relies on “transparency” to question the scientific credentials of polling houses that rely on non-probability samples, like YouGov, Gelman, clearly wedded to the transparency norm, underscores the fact that YouGov’s chief scientist “has detailed the methodology at length and subjected the methodology and results to public transparency that rivals the best practices of major polling companies.” In addition, that same individual has written “academic papers (…) published in the top peer review journals.” He adds: “If anything, people on the cutting edge of research are not hiding anything; on the contrary, we are fighting hard to overcome entrenched methods by being even more diligent and transparent.” So there you are. Who could doubt the scientific worth of the new polling techniques? They have been peer-reviewed and they are as transparent as Baccarat crystal. Clearly they have proven, so Gelman believes, their scientificity, and therefore their legitimacy. So what’s the beef?
In his diatribe of August 6, Gelman adopts a rhetoric that has quasi-moralistic tone: he paints AAPOR as a force opposing progress. The title of his post could not be more explicit: AAPOR is stuck in the past, still relying on the horse-drawn “buggy” to get around, whereas he and his acolytes are the forces of progress, gliding in the most up-to-date mode of transportation, the automobile, propelled by the internal combustion engine. Who could argue against progress? Who would support obscurantism? AAPOR, apparently. Thus, Gelman’s whiggish attitude seems to want to locate this venerable institution beyond the pale – in that hellish zone of non-science. But really what he wants AAPOR to do is to recognize the scientific character, the legitimacy and respectability, of the new polling methodologies. It is time, he proclaims, to expand the scientific territory, to push back the boundaries, for the de jure to catch up with the de facto.
Resolution? Plus ça change… or “déjà vu all over again” (Berra)
The debate around the NYT/CBS announcement boils down to this: are polling samples based on Internet opt-in panels ready for prime time or not? AAPOR say no, Gelman and like-minded researchers say yes. How is this controversy going to be resolved? If the issue appears to be unsettled, it is only in the sphere of the de jure (an official acknowledgment from AAPOR), because, on the ground, in the de facto world, it has been resolved: pollsters have “voted with their feet”. Internet opt-in panels have been in use in the commercial polling world for nearly two decades. Powerful economic interests are at stake here: all the corporate polling organizations that have sprouted as a result of the advent of the Web. And it is not some statistical theory (probability sampling), however prestigious, especially when its application is doubtful and cumbersome, that is going to stand in the way of business: clients expect actionable results, while the polling house expects to be profitable – and so do corporate clients. Besides, as some believe (Gelman and others), plenty of tools have been developed to mitigate the limitations of self-selected (opt-in) samples, and their scientific character cannot be impugned: they can “mirror” reality just as well (or as badly) as the next probability sample.
The issue is how is this going to worm itself into AAPOR’s code of professional practice? In fact, non-probability samples have already carved themselves a bit of territory within the AAPOR canon. The current AAPOR Code of Ethics (November 2015 update) states: “Disclosure requirements for non-probability samples are different because the precision of estimates from such samples is a model-based measure (rather than the average deviation from the population value over all possible samples). Reports of non-probability samples will only provide measures of precision if they are accompanied by a detailed description of how the underlying model was specified, its assumptions validated and the measure(s) calculated. To avoid confusion, it is best to avoid using the term “margin of error” or “margin of sampling error” in conjunction with non-probability samples” 16. So the non-probability sample, anathema as it was in the not too distant past, has got its foot in the door – and then some. Does that mean the controversy is over? Apparently so. Of course, a lot of folks are not too crazy about non-probability samples; their probability counterparts are so much neater – if only those darn people cooperated, the blissful days of the 70%+ response rate would be back. But what can you do, if you’re not the Federal government? The show must go on as thespians say. Hence the online opt-in panel. Thus, non-probability sampling and probability sampling, now both harboring the science label, seem destined to live side-by-side in peaceful coexistence for the foreseeable future.
The polling profession has accomplished a complete circle: it started its modern career (ca. 1935) using non-probability samples (quotas), and now it has gone back to its roots by relying on opt-in online panels. And both claim to be scientific. Another feature they have in common is their dependence on very large samples, much larger than is required if one uses probability sampling. In the ‘30s, Gallup used “vote poll” samples in the one hundred to two hundred thousand range 17. This was considered progress compared to the mass mailing (10 million) done by the most prestigious poll of that era: the Literary Digest poll. The scientific pollsters (Crossley, Gallup, and Roper) considered the Digest’s approach to be wasteful, among other things. Nowadays, online polling organizations also rely on samples in the tens of thousands to make their forecasts.
Scientific practice, here social scientific practice, seems to be ruled, in part, by the Humpty-Dumpty philosophy: “Science means just what I choose it to mean–neither more nor less.” (The reader will forgive, I hope, the poetic license, once again.) Moreover, what constitutes science depends on the circumstances. As I said, quota sampling that was used in the 30s and 40s by Crossley, Gallup, and Roper was considered scientific, and labeled as such, even though probability sampling was known and, in 1934, it was demonstrated by a Polish mathematician-statistician, Jerzy Neyman, to be superior to any other form of sampling. The pollsters never adopted probability sampling until well after their disastrous prediction of a Dewey victory over Truman in the 1948 presidential election. Folks in federal agencies, such as the Department of Agriculture, quickly adopted Neyman’s approach, and he was invited to lecture the staff on the issue of probability sampling. So, in effect, two forms of “scientific” sampling, although apparently polar opposites, one a probability methodology, the other a non-probability practice, coexisted during a number of years. Why does that sound familiar?
But let’s come back to our world. Whose “science” is winning? Link’s or Gelman’s? But is there a contest in the first place? I think not – in spite of the appearances: the bile spilled, the moral high ground (e.g. innovation vs. “methodological conservatism”), the abandonment of standards, etc. Gelman and like-minded data analysts are going about their business. As Gelman puts it, addressing AAPOR: “How bout [sic] you do your job and I do mine.” Indeed, no one in one’s right mind is going to strip probability sampling of its scientific legitimacy. But it is its practical implementation these days that makes it problematic for many survey researchers, thus their reliance on the opt-in methodology thanks to the rise of the Internet. This difficulty in applying probability sampling is reminiscent, if the reader allows me to go down memory lane once again, of the assessment made by the pollsters of the 30s and 40s. Gallup wrote: “Although random sampling can be highly accurate in the case of homogeneous populations, and is in many cases the simplest sampling method, there are times when it cannot be used successfully. Sometimes the statistical universe is heterogeneous–that is, it is composed of a number of dissimilar elements which are not evenly distributed throughout the whole. In addition, the universe is sometimes so widely distributed or so inaccessible that it is not feasible to set up a random sampling procedure which will guarantee that each unit has an equally good chance of being included in the sample” 18. Thus, they chose to use quota sampling. Gallup and his fellow pollsters were not the only ones in those days to think that way. As eminent a statistician as Samuel Wilks could state: “In the case of large-scale polls, which are made on a state-wide or nation-wide basis, it is clear that it would be impossible, or at any rate highly impractical to draw a random sample from the population under consideration” 19. Just like today, the pollsters of yester-years found it very difficult to implement probability sampling, so they relied on a non-probability methodology to select respondents to their polls.
I have tried to illustrate the back-and-forth way the science label has been attached to and then taken away from non-probability sampling depending on the circumstances. During the early era of modern polling (1935-1948) in America, pre-election and issue polls were characterized by a distinctly non-probability methodology, the quota sample, which, nevertheless, was branded as scientific by the pollsters of that time. The circumstances then were that probability sampling was not a viable method for the pollsters in those days. During the golden era of random-digit-dialing (RDD) telephone surveys, any form of non-probability sampling was frowned upon and considered distinctly non-scientific. Respondents in non-probability samples only represent themselves, we were told sternly. The circumstances then were that polls were blessed with relatively high response rates (70%+). Then, just in time, came the Internet or Worldwide Web era, and non-probability samples were back in business. The circumstances then were that traditional RDD survey were (are) plagued with appalling low response rates making it increasingly costly, and thus impractical, to implement this methodology. That tension is present in the world of polling and survey research seems clear enough. On one side is the AAPOR statement; the association appears reluctant to confer the science label to opt-in Internet polls. On the other, there are those who rely squarely on that technology and believe in its scientificity. Nowadays, we live in an era, and not for the first time in the history of polling, in which two seemingly opposite sampling methodologies are used by practitioners. Both technologies have been labeled as science and both are riding into the sunset, perhaps not hand-in-hand but definitely side-by-side, towards new successes and failures for the foreseeable future. Does the Spanish philosopher’s, George Santayana, adage “Those who cannot remember the past are condemned to repeat it”, apply to the polling industry or not? Or does it matter?
7 Thomas F. Gieryn: “Boundary-Work and the Demarcation of Science from Non-Science: Strains and Interests in Professional Ideologies of Scientists”, American Sociological Review, Vol. 48, No. 6 (Dec., 1983), pp. 781-795.
8 They were criticized by a few statisticians during the course of a congressional hearing in December 1944: Hearings Committee to Investigate Campaign Expenditures House of Representatives Seventy-Eighth Congress Second Session on H. Res. 551. See for example p. 1294: “The quota-sampling method used, and on which principal dependence was placed, does not provide insurance that the sample drawn is a completely representative cross-section of the population eligible to vote, even with an adequate size of sample.” But to no avail.
9 In other countries, France for example, polling organizations have been using the quota methodology with, presumably, as much success and failure as their American counterparts using probability sampling. To paraphrase, with some poetic license, one of their 17th century compatriots: science on this side of the Atlantic, non-science on the other side.
10 This is a fascinating case and a real treasure trove for the sociologist of (social) scientific knowledge, and merits a post in itself – I will work on it. I mention it here because it was roundly criticized by AAPOR, among others, for the lack of randomness of the samples and the very low response rate to its questionnaires – in other words, not much different than one of today’s Internet or telephone surveys.
11 Benjamin Ginzburg, “Dr. Gallup on the mat”, The Nation, December 16, 1944, pp. 159, 737-739.
12 Joseph Alsop, “Dissection of a Poll”, The New Yorker, September 24, 1960, pp. 170-174, 177-184.
13 http://www.statnews.com/2016/02/11/stat-harvard-poll-gene-editing/ and https://cdn1.sph.harvard.edu/wp-content/uploads/sites/94/2016/01/STAT-Harvard-Poll-Jan-2016-Genetic-Technology.pdf (p.10 for the methodology; retrieved Thu 2/11/2016).
14 Academic survey research centers don’t escape the bottom line either: they will be closed down if they don’t meet certain financial standards. Knowledge production is good but not at any cost.
15 Pollsters and survey researchers have always had to struggle with low response rates: in other words, low response rates are nothing new. Contrary to what a recent article in the New Yorker claims [http://www.newyorker.com/magazine/2015/11/16/politics-and-the-new-machine] (a claim later picked up by the Guardian [http://www.theguardian.com/us-news/datablog/2016/jan/27/dont-trust-the-polls-the-systemic-issues-that-make-voter-surveys-unreliable]), response rates in 1930s in America were not in the 90s. The most prestigious poll during that era was conducted by the Literary Digest (a weekly magazine similar to today’s Time): the highest response rate it achieved was about 24% in 1930 and 1936. When the new pollsters (Crossley, Gallup and Roper) emerged in 1935, they used quotas as their sampling methodology from which a response rate cannot be computed. However, Gallup did use mail-in ballots, in addition to in-person interviews, for his pre-election polls of 1936. Two researchers assessing Gallup’s ballot returns wrote: “As a rule less than one-fifth of the mailed ballots are returned and these tend to come from selected groups. (…)The [Gallup] Institute found that the largest response (about 40 per cent) came from people listed in Who’s Who. Eighteen per cent of the people in telephone lists, 15 per cent of the registered voters in poor areas, and 11 per cent of people on relief returned their ballots” – a far cry from 90% (Daniel Katz & Hadley Cantril, “Public Opinion Polls”, Sociometry, Vol. 1, No. 1/2, Jul. - Oct., 1937, p.160).
17 “POLL: Dr. Gallup to Take the National Pulse and Temperature”, News-Week, October 26, 1935, p.24. Gallup was less than transparent when it came to revealing the exact size of his samples.
18 George Gallup and Saul Forbes Rae, The Pulse of Democracy: The Public-Opinion Poll and How It Works, 1940, Simon & Schuster, New York, p.59.
19 Samuel S. Wilks, “Representative Sampling and Poll Reliability”, The Public Opinion Quarterly, Vol. 4, No. 2 (Jun., 1940), p. 262.