In March of 2016, the American Statistical Association (ASA), “the world’s largest professional association of statisticians” (5) 1, took an unprecedented step: it issued a statement (“ASA Statement on Statistical Significance and P-values”), which was published online under the auspices of one of its publications, The American Statistician (TAS), on the “proper use and interpretation” (7) of a certain statistical measure – the “p-value”. For those of you who escaped, in High School and/or College, the blissful world of statistics, “p-value” is short for “probability value”, which is a numerical index practitioners rely on to reach a conclusion based on the data they’re analyzing. Generally speaking, we use probabilities as a measure of uncertainty: the probability of coming up Heads on the flip of a fair coin is said to be 0.5 or 50%; if you decide to play the lotto (SuperLotto Plus) in my home state of California, there is roughly one chance in 40 million that your number will come up; in other words, you’re near certain to loose – but, still, there is a chance, however infinitesimal, that you will win. (NOTE: This is not an endorsement for games of chance.)
The pronouncement, whose target audience are “researchers, practitioners and science writers who are not primarily statisticians” (3), was unheard of because never before in its long history (the association was founded in 1839) had the ASA told practitioners how to use any statistical technique or methodology. (In the interest of full disclosure I should let the reader know that I am a member of the ASA.)
How did this all come about? An introduction to the ASA statement, written by the association’s executive director, Ron Wasserstein, and the editor of TAS, Nicole Lazar, provided some background information. Its purpose is to explain what led the association’s board of directors to make the statement and the process that led to its publication. Wasserstein and Lazar identified two areas of concerns that “stimulated” (1) the board’s response: i) a recent, “highly visible” (1), and ongoing discussion within scientific journals on the questionable use of statistical methods in the process of scientific discoveries; ii) the reproducibility and replicability “crisis” (2) in science.
Regarding the first issue, Wasserstein and Lazar quote several sources that talk about statistics and its “flimsy foundation” (1), its “numerous deep flaws” (1), etc., on one side; and on the other, the defenders of statistical methods who claim that the problem is not statistics, but that a lot of data analysis is done by people who are not “properly trained [my emphasis] to perform” (1) it. The second area that spurred the ASA board to action is described thus: “The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions.” (2) “Reproducibility and replicability” means that nobody else can come up with the “scientific conclusions” the original researchers presented. For example, some readers may remember the “cold fusion” episode, back in 1989. Well, here was a case where two Utah University researchers made a claim to scientific discovery, but no one else in their field was able to replicate their findings.2 The “reproducibility and replicability crisis” has been brewing for a few years, but seems to have come to a head in August 2015 with an article in the journal Science which found that although 97% of the original studies, albeit in the field of psychology, that were scrutinized reported “statistically significant” results, only 36% of the replications did (p.944).3 In the ASA’s view, this creates “much confusion and even doubt about the validity of science” (2), and the “misunderstanding or misuse of statistical inference” (2) is partly responsible for this situation.
The authors tell us that “the Board envisioned that the ASA statement on p-values and statistical significance would shed light on an aspect of our field that is too often misunderstood and misused in the broader research community, and, in the process, provide the community a service.” (3) At the Board’s behest, Wasserstein assembled a “group of experts representing a wide variety of points of view” (3) to complete this task. Wasserstein and Lazar report that the “statement development process was lengthier and more controversial than anticipated.” (4) They also assure us that “nothing in the ASA statement is new. Statisticians and others have been sounding the alarm about these matters for decades, to little avail.” (5) They expressed the hope that the statement “would open a fresh discussion and draw renewed and vigorous attention to changing the practice of science with regards to the use of statistical inference.” (5)
The ASA’s message
Given this array of dire circumstances, e.g., some researchers throwing the “p-value” into the “dustbin of history” (more on that later), one could expect a statement full of vim and vigor, breathing fire and brimstone, and mounting a vigorous defense of one of the cornerstones of the “science of statistics”. But no. Instead, we are regaled with a pronouncement couched in very mild-language whose ambition is to clarify “several widely agreed upon principles underlying the proper use and interpretation of the p-value” (6-7). So, for example, it tells us that “the p-value can be a useful statistical measure”: hardly a ringing endorsement, but neither is it a recommendation to discard it. The ASA statement is divided into five sections: Introduction; What is a p-value?; Principles; Other Approaches; and Conclusion. When Wasserstein and Lazar in their Introduction (1-6) state that “[n]othing in the ASA statement is new” (5), they are not kidding. The contents of the Principles section of the statement, reads, for the most part, like something students taking their first introductory course in statistics would find in a widely relied upon textbook like David S. Moore’s The Basic Practice of Statistics (New York: W.H. Freeman, 1995; now in its seventh edition): for example, failing to reject the null hypothesis (H0) does not mean you have proved it to be true; rejecting the H0 does not mean it is false or that your research hypothesis (symbolized as H1) is true; statistical significance (i.e. rejecting the H0 and concluding in favor of H1) is not necessarily the same as substantive or clinical importance; etc. The beginner in statistics is, of course, entering a new cultural realm; like any other practice such as learning to be a chef, a crane operator, an automotive technician, or a brain surgeon, it is a process of socialization, i.e. it is a process that inculcates the rules and behavior that are considered appropriate within that culture. Thus, all these prescriptions are rituals used to induct you into the scientific culture of inferential statistics: the student learns the norms that define the “proper use and interpretation of the p-value” (8).
The statement claims that “misuses and misconceptions concerning p-values” are “prevalent” (11) in “the broader research community” (3) among those “who are not primarily statisticians”. It also states that “some statisticians prefer to supplement or even replace p-values with other approaches” (11), thereby encouraging “the broader research community” to do the same. One of the “other approaches” mentioned are “confidence intervals” (11). I would wager to say that there is just as much “misuses and misconceptions” (perhaps more “misconception” than “misuse”) of “confidence intervals” among “the broader research community” as there are concerning p-values. To take just one example: a scholarly book first published in 2012, which is a compendium of articles on a specific topic. (The book will remain nameless, as will the author, and the quote has been modified to insure anonymity. I do not wish to embarrass anybody, after all, errare humanum est, and I’ve done plenty of that myself, thus hardly in a position to cast the first stone.) The article in question, suitably altered, states: “Swedish public approval for paternity leave is 67% ±3 percentage points. (…) [I]n repeated samples, we would expect the true level of public support for paternity leave to fall between 64 percent and 70 percent in 95 out of 100 samples.” Clearly, this illustrates a misinterpretation of the concept of “confidence interval”, but the editors of the book did not catch it in time. However, the second edition, published four years later, corrects the mistake: “if the survey were repeated many times, 95 percent of the samples of this size would be expected to produce a margin of error that captures the true percentage of Swedes supporting paternity leave.” Therefore, advocating the use of confidence intervals in lieu of p-values does not seem to be much of a solution. Obviously, supplementing the p-value with a confidence interval would not satisfy, one would think, those who advocate its abandonment.
The ASA statement is by no means condoning the banishment of the p-value – nor is the ASA likely to do so in the future. This methodology has been with us for nearly a century and has been used, correctly or not, in multitudes of studies in a variety of disciplines that all harbor the science label. It is an elaborate scheme that has been the centerpiece of statistical practice and based on the work of heavy hitters like Ronald Fisher (1890-1962), Jerzy Neyman (1894-1981) and Egon Pearson (1885-1980).
Other “other approaches” (11) mentioned in the statement: Bayesian statistics. This is a methodology that lost out to what is often referred to as the frequentist school (Fisher, and Neyman-Pearson) back in the 1930s and 40s. These are the two major schools of inferential statistics. “Lost out” does not mean the Bayesian approach is without its aficionados: in fact, it has been used by a substantial minority in the statistical community starting in the 1950s. But it has always been treated as a second rate citizen in the world of statistics: most introductory textbooks and beyond teach the frequentist creed (null hypothesis testing and the p-value), and (just about) all the commercial software packages are programmed along that same doctrine. It is why you are more likely to be assigned as an introductory textbook the one mentioned earlier by David S. Moore, or one by Mario Triola (Elementary Statistics), rather than one by Donald A. Berry (Statistics: A Bayesian Perspective). But Bayesianism is not without controversy either…
Statistical testing by means of the null hypothesis (NHST, hereafter) and its resulting p-value is one of the cornerstones of knowledge production in many sciences. Back in 2001, a prominent statistician could write: “hypothesis testing has become the most widely used statistical tool in scientific research” (David Salsburg, The Lady Tasting Tea, p.114). Controversy about this approach is by no means new, it is almost as old as methodology itself, and it has been going on ever since – that hasn’t stopped the majority of statistics users from relying on this approach (“that’s what we’re taught; so we do what we’re taught”). So why did the ASA feel that it had to come out with a statement at this time in the history of its discipline? In other words, why this sudden urge on the part of the ASA to intervene on a topic that has been contentious for decades? I’d like to suggest a few items that might help make sense of the ASA’s action, and the tone and contents of its statement. I’m sure there are many more, but these come to mind immediately and seem to me to be important.
First, the dominance of the frequentist school is being challenged; not so much by means of debates between the two camps (those have been going on for years), but by a sort of critical-mass effect favoring the Bayesian school of thought: i.e. an increasing number of practitioners are adopting that approach. It seems to me that the frequentist creed has lost the hegemonic position it occupied for so long in the field of statistics. As David Salsburg states in his very informative history of statistics in the 20th century: “By the end of the twentieth century, [Bayesian statistics] had reached such a level of acceptability that over half the articles that appear in journals like Annals of Statistics and Biometrika now make use of Bayesian methods” (op. cit., 129-130).
The message from the ASA statement is: there is more to the practice of statistics than NHST. This is reflected in the composition of the panel of experts put together by the ASA. At first glance, and as best as I can judge, it appears to me that there is a good mix of different schools of statistical practice, including Bayesian and frequentists, which may also explain the mild language I referred to earlier. The “wide variety of points of view” (3) represented by the group of experts insured that the ASA statement would be reflective of that diversity, thus requiring compromise (“The statement development process was lengthier and more controversial than anticipated” (4)). As the introduction by Wasserstein and Lazar reports, the statement went through “multiple drafts” (4). The ASA statement clearly shows that nobody got the upper hand: there’s NHST of course, but there are also “other approaches” just as important.
I have mentioned earlier that the contents of the Principles section of the ASA statement reads like something you would find in an introductory statistics textbook. Did the ASA believe that the reaffirmation of these principles would put an end to the “misunderstanding or misuse of statistical inference” (2); that “the proper use and interpretation of the p-value” (8) would blossom as a result? After decades of the inculcation of thousands upon thousands of students in a variety of scientific disciplines into the ways of NHST, the misuse and misinterpretation of the p-value persist. It is not the reiteration of rules that folks have been taught in their first class of statistics (and repeated thereafter) that is going to change that situation. So what’s the point? I would say that the ASA statement is primarily a symbolic act. Of course, misusers and misinterpreters of statistical inference are not going to see the light suddenly after reading the Principles section of the statement. What the ASA is doing is asserting its jurisdiction over the field of statistics. It is telling those “who are not primarily statisticians” (3) that it alone has the authority to determine what qualifies as the “proper” use of statistical tools. It is not for non-statisticians to decide what in statistics should or should not be discarded. As the Introduction to the ASA statement says: “Though there was disagreement on exactly what the statement should say, there was high agreement that the ASA should be speaking out about these matters” (5). In other words, despite the diversity and discord within the ASA, it stands united when it perceives that the discipline itself is being challenged. Certainly one of the functions of a professional organization is to uphold and maintain the good reputation of its area of activity.
Although the ASA statement assigns responsibility for the “crisis”, it eschews conveniently the issue of ultimate cause. It piously and wishfully states that “that the scientific community could benefit from a formal statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value” (7-8). In the Introduction to the statement, the authors bemoan that “our field (…) is too often misunderstood and misused in the broader research community” (3). Thus the blame is placed squarely, although indirectly, on those “who are not primarily statisticians” (3) – who are, I would submit, the vast majority of statistics users. I would also venture to guess that the advent of the personal computer and the development of off-the-shelf commercial statistical software have made the access to statistical tools relatively easy, and, as a result, has increased substantially the community of those “who are not primarily statisticians” – thereby making the opportunity for misuse and misinterpretation that much more likely. Statistics is a unique field in that its products are used mostly by non-statisticians. Anytime researchers, whatever their discipline, have collected quantitative data they are most likely to make use of the tools supplied by statistics. As Neyman said, back in 1955, statistics is the “servant of all sciences.” In effect, the ASA statement is deflecting responsibility away from its own discipline and telling non-statisticians: “There is nothing wrong with our tools. We know how to use them properly. It is you, non-statisticians, who misunderstand and misuse our products.”
Let’s look more closely at the way the ASA statement frames this controversy. To counter those in the “broader scientific community” (3) who question the very usefulness of statistics (“it’s got more flaws than I’ve had hot dinners” – a Rumpolian version of Tom Siegfried’s assertion for the reader who is not Facebook-savvy)4, it reiterates the key role statistics has in the production of scientific knowledge; but, more importantly, it identifies the individual user, who is not primarily a statistician, as a causal agent in this crisis. What are the implications? First, as already mentioned, it locates the problems outside the discipline of statistics: a) there is nothing wrong with the corpus of statistical knowledge; b) statisticians know how to use the tools of their trade, and know their limits. Some who are not primarily statisticians do not. Perhaps, another delicately hidden message is that “the broader research community” (3) should rely more heavily on statisticians instead of trying to go at it alone and making a mess of it.
The ASA statement tells us: “Statisticians and others have been sounding the alarm about these matters for decades, to little avail.” (5) So, are these statistics users deviants persisting in their deviant ways? After all, they don’t follow the statistical rules, and as a result give statistics a bad name, and impede scientific progress. No. There is a difference between breaching an ethical or moral norm, and failing to follow technical prescriptions. Although the latter might give rise to chastisement and calls of incompetence, it does not bring about the indignation, the moral indignation that the former would. This also explains the tone of the ASA statement: no “fire and brimstone”. Often times, ethical norms violations call for punishment. For example, if a researcher falsifies his data and is discovered, punitive action is likely to follow: his published paper will be retracted, he might be degraded (loss of an academic title, e.g. “PhD”), he might lose his job, and he might even be sued in a court of law. Not so for the misuse or misinterpretation of a technical rule. It might cause the user some embarrassment, and she might be reprimanded; and if this misuse or misinterpretation happens to be published, it will be corrected, often quietly (i.e. without a corrigendum or erratum) at the first opportunity by a vigilant editor, as in the confidence interval example given earlier. Contrary to the moral violator or deviant, in the case of the violation of a technical norm, the culprit is not seen as somebody who willfully does so. The latter will be likely and gladly willing to mend his or her ways; not the former, who will have to be coerced into submission. The ASA has no enforcement power over honest “misuses and misinterpretations”, especially if these are perpetrated by non-members.
Perhaps, an additional implicit message from the ASA statement is that researchers outside the statistical community are not being trained in the proper use of the methodologies provided by the field of statistics. In other words, this has to do with showing the technical norms-violators the correct path: “changing the practice of science with regards to the use of statistical inference.” (5)
One important ingredient for a “problem” to get noticed (or, perhaps more precisely, for an issue to be elevated to the status of “problem” or “crisis”) and, eventually, acted upon, is for it to be covered by the mainstream media.5 The ASA statement says as much when it refers to the “highly visible [my emphasis] discussions over the last few years” (1). What it does not tell us is that until recently the controversy over NHST was confined within specific disciplines like psychology (a big consumer of statistics), sociology, epidemiology, etc., and, lest we forget, statistics. In other words, the disputes regarding that topic were happening largely behind closed doors, so to speak. But in this latest round, the controversy spilled into the general scientific media – it made it into journals that are the flagships of the scientific community: Science and Nature. As long as the NHST controversy was limited to the pages of the American Journal of Epidemiology, the Journal of Experimental Education, Quality & Quantity, or the American Sociologist, not to mention the pages of statistical journals, “nobody” really paid attention to it. But when it got splashed all over the pages of the very prestigious mainstream scientific publications mentioned, it could no longer be ignored. It became incumbent upon the ASA to intervene: it could not sit back and let the discipline be bandied about.
And then, there is data science – that topic in itself deserves a post. Data science is encroaching upon the territory traditionally occupied by the discipline of statistics. Its emergence is a rather recent phenomenon. It took many statisticians by surprise. Isn’t statistics the science of data? Statistics is often defined as the study of the methods for collecting, processing, analyzing, and interpreting quantitative data. Or more succinctly, as David Moore puts it: “Statistics is the science of learning from data.” In 2013, the president of the ASA asked in an editorial in the association’s monthly magazine: “Aren’t we data science?”6 Somehow some folks outside the field of statistics (e.g. computer science) discovered an area of data that they believed statistics could not deal with: big data! As big data and data science were commanding the attention of the Obama administration, major institutions (e.g. National Science Foundation, National Institutes of Health), and the media, traditional statisticians were being bypassed, ignored, and felt they were being left behind. Data science appeared to portray itself as an independent field, not as a specialty within the discipline of statistics, like biostatistics, for example. It seemed to be questioning the authority, and hence the legitimacy, of statistics. This threat to statistics’ customary bailiwick was taken seriously by the ASA leadership, and it responded quickly (although some ASA members would argue “not quick enough”) based on the principle “if you can’t beat them join them.” Well, not quite, it was more like: let’s try to co-opt them into our fold (“bridging the ꞌdisconnectꞌ”).7 Thus, ASA members, suddenly, saw the expression “data science” pop-up all over the place. For example, the ASA journal that started publication in 2008 (before the “data science” craze) under the title Statistical Analysis and Data Mining has now been given (as of the beginning of this year) the subtitle “The ASA Data Science Journal”. Our sisters in statistics who attended a “Women in Statistics” conference two years ago, will now (in 2016) attend a conference called “Women in Statistics and Data Science”, and may well wonder if there is not redundancy in this new title. Did the women in statistics forget to invite the women in data science back in 2014? Or did the women in statistics assume that “Woman in Statistics” was an all-inclusive title (i.e. by definition women in statistics were doing data science, what else?)?
The reader may well ask: what does all this have to do with the price of fish? Or, more appropriately, what does the emergence of data science and big data have to do with the ASA’s statement on p-values? I hope the reader will forgive me for the platitude but everything happens in a context. What I am suggesting is that the brouhaha about data science and big data is part of the wider context in which the field of statistics has had to withstand some serious probing. For example, two economists, back in 2010, wrote an entire book arguing that “Statistical significance is not a scientific test” (p.4).8 In 2015, the editors of one social science journal, Basic and Applied Social Psychology (BASP), let their readers and potential contributors know that both NHST and confidence intervals would be banned from their publication.9 In their 2016 editorial, the editors of the same periodical lament the fact that “many researchers continue to believe that p remains useful for various other reasons” (p.1).10 In other words, and in direct contradiction of the ASA statement, which would come out a month later, the BASP editors don’t believe the p-value to be a “useful statistical measure”. As a result of this worrying environment, the ASA has had to step in and defend its turf. Its statement is a declaration in defense of the integrity of the discipline and an affirmation of the organization’s jurisdiction over matters statistical.
In summary, I see the ASA statement as primarily a symbolic act that came about as a result of the hostility against statistics expressed within the scientific community in the past few years. More precisely, it is a symbolic act under the guise of being instrumental. The instrumental guise part of the statement consists in declaring that by reiterating principles widely accepted in the statistical community “the conduct or interpretation of quantitative science” could be improved. (8) The statement seeks to defend the integrity and the value of the discipline, and reaffirms its central role in the production of scientific knowledge; it reasserts the ASA’s authority over matters statistical; it establishes a clear boundary between statisticians and non-statisticians and asserts that the latter misuse the tools provided by the field of statistics, and, consequently, that they, not statistics, are one of the causes for the replicability and reproducibility crisis.
1 All numbers in parentheses refer to the pages of the document published in TAS: “ASA Statement on P-Values and Statistical Significance” (Ronald L. Wasserstein & Nicole A. Lazar (2016): “The ASA’s statement on p-values: context, process, and purpose”, The American Statistician). It can be accessed freely at the following page: http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108. To learn more about the ASA go to http://www.amstat.org/ASA/about/home.aspx.
2 “Reproducibility” refers to the inability to redo the original data analysis despite being given the data and the analytic procedures followed by the original researchers.
3 Brian Nosek, “Estimating the reproducibility of psychological science”, Science, 28 August 2015, http://science.sciencemag.org/content/349/6251/aac4716.
4 Siegfried, T. (2014), “To make science better, watch out for statistical flaws,” ScienceNews, available at https://www.sciencenews.org/blog/context/make-science-better-watch-out-statistical-flaws.
5 By “mainstream media” I mean, as will become clear shortly, that of the scientific community. Although the controversy was given some press in the mainstream lay media.
6 Marie Davidian, in Amstat News, July 2013, pp. 3-5.
7 “The ASA and Big Data”, Nathaniel Schenker, Marie Davidian, and Robert Rodriguez, in Amstat News, June, 2013, p. 4.
8 Deirdre Nansen McCloskey and Steve Ziliak. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives; Ann Arbor, MI: University of Michigan Press, 2008.
9 David Trafimow & Michael Marks (2015) Editorial, Basic and Applied Social Psychology, 37:1, 1-2.
10 David Trafimow & Michael Marks (2016) Editorial, Basic and Applied Social Psychology, 38:1, 1-2.