Masks, Rorschach Science, and Pascal's Wager
Marc Green
Statistics, because they are numbers, appear to us to be cold, hard facts. It seems that they represent facts given to us by nature and it's just a matter of finding them. But it's important to remember that people gather statistics. People choose what to count, how to go about counting, which of the resulting numbers they will share with us, and which words they will use to describe and interpret those numbers. Statistics are not facts. They are interpretations.(Levitin, 2016)1
Were Shakespeare alive today, he probably would have written, "To mask or not to mask. That is the question." A recently published study commonly called DANMASK-19
2 supposedly provides new evidence on this question. It is a randomized control trial that examined the effects of wearing a mask on chances of contracting Covid-19. DANMASK-19 provides a case study showing the complexities and difficulties of real-world research. It demonstrates why answering apparently straightforward questions is so difficult. ("Is this food/drug/activity good for me or bad for me?"). It also shows why studies so often conflict and why they so often fail to replicate.
However, DANMASK-19, and the controversy surrounding it, also says much more about the role of science in society and even about society itself. It highlights the subjective nature of research interpretation. DANMASK-19 has something for everyone. It has been used by both pro-maskers and anti-maskers as supporting their preconceived beliefs. This is "Rorschach Science," where the science "consumer" projects his own beliefs and ideology on to the results, often never having read the paper. It further reveals how science has morphed into propaganda in an ideological battle.
Before starting, I want to clarify that this article is not aimed at arguing for or against mask wearing. To have an informed opinion, I would have to plow through a mountain of studies in order to review and evaluate the research. Even then, I suspect that the research might well prove equivocal for reasons exemplified by the DANMASK-19 study.
So I'll leave this task to those less easily bored or more ideologically motivated.
[Note Added: Fortunately I don't have to review the research literature because Cochrane Review has.
Their findings are published as
Physical Interventions to Interrupt or Reduce the Spread of Respiratory Viruses which is available online.]
Background
Before examining the study itself, it's worth taking an overview of epidemiological methods to see why DANMASK-19 was considered to be potentially important research. Historically, there are two primary methodologies for determining the effect of some variable on a population. One is "case-control," where the research mines already-existing data. For example, it might look at the population of two cities, one where laws required masks and another where mask-wearing was not required. It would then run a statistical test to determine whether the two cities differed in Covid-19 infection rate. The standard comparison method is "null hypothesis significance testing (NHST)," a statistical technique that determines whether any difference between the control and the test groups is "real" and not due to chance.
However, case-control studies are prone to a number of problems. They are backward looking, so their ability to predict the future is always in question. They are also vulnerable to a wide array of sampling biases. At best, they provide evidence of correlation but not of causation. The biggest problem is often the existence of confounding factors. The result is not due to the variable being tested but to some other uncontrolled variable. For example, one study found that people who drank more red wine were healthier. We should all drink more red wine, right? Unfortunately, another study examined cash register receipts and found that people who drink more red wine also bought more fruits and vegetables. The important variable was probably not red wine but rather general diet, which was not controlled in the original research.
The problems of case-control are well-known so an alternative methodology, the "randomized control trial" (RCT), is considered the gold standard of methodology. Like a laboratory experiment, it is a forward-looking method. The researchers define a population and randomly assign each new recruited subject into either a test or a control group. The random sampling should theoretically partial out confounding factors.
3 After a time period, the test and control groups are compared on some outcome measure using NHST. DANMASK-19 is noteworthy because it is the first large RCT study of mask effectiveness. As will become clear, however, the RCT method in practice is never quite as simple as it sounds.
The Method
I first briefly outline the method, skipping over some details. The interested reader can find the study online for more in-depth information. However, none of the details affect anything said below.
Subject selection is a key factor in the success of an RTC. DANMASK-19 issued a public call and recruited about 6000 subjects through the media. This immediately raises questions about the representativeness of the study group to the general population. Not everyone wants to be in such a study, especially given the requirement that they perform Covid-19 tests on themselves at the beginning and end of the test period. The authors themselves commented that "Participants may have been more cautious and focused on hygiene than the general population."
The masks used in the study would limit any conclusions. The test subjects wore "high quality surgical masks" with a filtration rate of 98 percent. These are far better than those worn by the typical real-world mask wearer. The question then arises as to whether the DANMASK-19 results would generalize to wearers of more typical masks. It might be expected that there would be more Covid-19 infection in people who wear the popular, low-quality blue paper masks, etc. The authors attempted to minimize the issue by citing studies that found no difference between N95 and surgical masks on the probability that healthcare workers would contract influenza. The relevance of these studies to Covid-19 and the types of masks used by the general population is certainly not clear. The subjects were also told to change to a new mask after eight hours outdoor on any day. Few, in any, in the general population follow such a regimen. Lastly, subjects were given no instructions on handling the masks, an important factor in Covid-19 transmission. Ideally, the mask wearer would avoid touching the outside of the mask, where virus could be present, before touching the face.
Subjects did not always follow instructions. Unlike lab rats, their behavior could not be monitored and their adherence to the study protocol could not be enforced. The test subjects instructed to wear the mask whenever outdoors. However, they varied on their adherence to this protocol. All subjects were further instructed how to perform several tests for Covid-19 infection. The most important tests were at the beginning of the study (to screen out those already infected) and at the end of the test period (to measure infection rate). The mask group subjects varied on their adherence to the protocol. The subjects also varied in their degree of self-testing. In both groups, some subjects failed to perform the required initial screen while 19 percent never submitted a completed test at the end of the study. To determine values for this 19 percent, the researchers resorted to "imputation," a method that fills in missing data values by extrapolating from the known to the unknown subjects. Imputation requires the assumption that the known and unknown subjects did not differ in any important way. The accuracy of this assumption is unknown and was not validated by the researchers. Bottom line: the researchers made up 19 percent of their data
4.
Each subject was supposed to perform several different methods to test for Covid-19. The accuracy of the self-tests is uncertain. Statistical issues also introduced further uncertainties into the results. The test for antibodies used a method with 82.5 percent sensitivity and 99.5 percent specificity. Sensitivity is the likelihood that a person with Covid-19 would test positive while the specificity is the likelihood that a person without Covid-19 would test positive, i.e., be a false alarm. As a result, the probability of a positive test accurately showing a Covid-19 case is only 77 percent.
5 However, the confidence interval for specificity reached down to 98.7 percent. In this case, the probability of a positive test being true is 56 percent! Moreover, the PCR test, the gold standard, revealed no infections at all in either test subject group.
The test period was theoretically 30 days long. However, Covid-19 can have a 14 day incubation period, so the test missed all infection that began in the second half of the period. The researchers toss this off saying that there is no reason that the data from these later infections would differ from those that were earlier and were recorded. This doesn't change the fact that their actual test period was only 16 days, a very small sample.
Lastly, the methodology had a fundamental problem - it wasn't blinded. The subjects knew which group they were in and that may have affected their behavior and the results. Of course, there was no way to avoid this problem since subjects obviously knew which group they belonged to. This does not lessen the problem. It is easy to imagine that risk compensation occurred; those wearing a mask paid less heed to social distancing and other safety measures than the non mask group. The researchers were well aware of this problem and mention it as another limitation of the study.
In sum, the research methodology was very loosely controlled. This was not due to research incompetence but was part of epidemiological research's DNA. The real-world is messy and performing tightly controlled research is difficult. It is one reason why the answer to many apparent simple questions about whether food/drug/activity is good or bad for you is often very difficult to answer definitively. Worse, it is a major source of the "irreproducibility crisis" that has spread across the world of scientific research.
6
The Results
The "primary outcome"
7 was that 42 participants (1.8 percent) of the mask group and 53 (2.1 percent) in the control reported positive tests. When subjected to NHST, the benefit of wearing a mask failed to reach statistical significance. However, the authors phrased the outcome somewhat differently saying that the data "did not reduce the SARS-CoV-2 infection rate among wearers by more than 50%." Apparently, even wearing a high-quality mask that was frequently changed failed the statistical test.
There is little doubt that the researchers were disappointed by the result. No one a performs a big, expensive research study with the aim of finding a nonsignificant result. Since the researchers failed to find a significant difference in the primary analysis, they performed "
post hoc (not preplanned)" analyses. This is typical of researchers who are desperate to rescue a failed study. If you can't get a significant result on the whole population, then break it down into smaller pieces, perform multiple tests and hopefully find something, anything significant, i.e., compare only subjects of similar age or gender, etc. This technique of applying the NHST
post hoc is called "P-hacking," which the scientific world rightfully looks upon with scorn because it makes obtaining a significant result by chance more likely. In any event, all of these "not preplanned" tests failed the significance test. When you can't even P-hack your way to significance, your study is really a dud. However, full marks to the authors for being up front about their P-hacking. Most researchers try to hide it.
The Interpretation(s)
DANMASK-19 failed to find a significant effect of mask wearing on contracting COVID-19. Therefore, the study proved that masks are ineffective. Mark one up for the anti-maskers!
Oh, wait.
The study could not prove any such thing. No study could. Even a first year statistics student knows that you can never prove the null hypothesis, i.e., that there is no difference. You can only fail to find support for H
1. Moreover, a single research study is virtually never definitive in epidemiological research. Science is more about the accumulation of evidence than the results of any one study. One of the many unfortunate aspects of NHST is that it forces an arbitrary dichotomization (significant/not significant) on to the data.
However, there is an even bigger and more basic issue. The major putative benefit of wearing a mask is "source control," the prevention of a person from spreading the disease to others. It is not to prevent the wearer from contracting the disease, yet this is exactly what the study measured. It says absolutely nothing about source control, the presumed primary public health benefit of wearing a mask. The authors were well aware of this and commented, "The findings, however, should not be used to conclude that a recommendation for everyone to wear masks in the community would not be effective in reducing SARS-CoV-2 infections, because the trial did not test the role of masks in source control of SARS-CoV-2 infection." So let me get this straight: DANMASK-19 was testing the wrong outcome measure and the negative results did not alter the researchers' views on mask wearing. This can only leave the reader bewildered as to the purpose of the research.
Even the authors' support for masks failed to prevent some pro-maskers from criticizing the research. Some said that the study did not refute the notion that wearing a mask is beneficial. Instead the study was inconclusive despite its failure to find a significant effect. They attacked the research methodology on many of the points described above. They also noted that the study's statistical power only allowed an 80 percent chance of detecting a 50 percent infection reduction. Only a relatively large effect would be statistically significant.
8
Other pro-masker scientists went even further, claiming that the study actually demonstrated mask effectiveness.
9 Although not statistically significant, they noted that 2.1 percent of the control group contracted the infection compared to a lower rate of 1.8 percent for the mask group. This difference suggests a benefit to mask wearers. However, this view ignored another finding buried in the data. Subjects who claimed to wear the mask "exactly as instructed" had an infection rate of 2.0 percent. This is higher than the 1.8 percent of the mask wearers in general, many of whom wore the mask intermittently (and almost the same as the non mask group). The difference is small so it may just be variability. At the very least, however, the lack of benefit for the more diligent mask wearers is evidence against mask effectiveness as measured in the study.
The pro-masker criticisms have some justifications but their attitude raises some questions. Yes, the study was loosely controlled but this is typical of large scale clinical trials. The 80 percent statistical power was actually quite good for such a study. Yes, there is plenty to
criticize about the NHST, and yes, the study was not designed to measure source control. However, DANMASK-19 researchers played by the rules of the NHST game, but the pro-maskers didn't like the results and decided to change the game. They decried the outcome by claiming that the sample was too small and/or by ignoring the NHST and accepting a mask benefit effect that wasn't statistically significant.
The dispute over DANMASk-19 raises several larger questions that go far beyond the study itself. You have to wonder whether there would have been the same attack on DANMASK-19 if the results had shown masks being effective. People seldom accept evidence that conflicts with prior belief.
10 Is it hypocritical to only accept NHST vetted research if it agrees with pre-conceived notions? Why bother performing research if it can only support prior belief? Should any possible health measure be enacted even it has not been vetted by NHST? In short, should we use "Pascal's wager" instead of science to make health and safety decisions?
Mathematician Blaise Pascal (1623-1662) sought a rational method to determine whether he should believe in god. He reasoned that the cost of saying "yes" (believing there is a god and there really isn't) and being wrong is small while the cost of saying "no" (believing there isn't a god but the really is) and being wrong is huge - the eternal fires of damnation. Pascal argued that given uncertainty the rational choice is to wager on "yes" because of the great disparity in the consequences of a wrong "yes" (false alarm) compared to a wrong "no" (miss).
11 Likewise, Pascal would say that even with the equivocal evidence the rational choice is to wear a mask: the cost of wearing one unnecessarily is small but the cost of not wearing one could be great. There is no need for NHST or science at all! Just look at the relative outcomes.
This seems like intelligent advice, but only at first glance. Accepting Pascal's wager creates the proverbial slippery slope. It could be used to justify any rule and regulation in the name of public safety. The result would be implementation of useless and costly public safety measures that strangle society in regulations. In our increasingly safety obsessed culture, this threat is all too real. If you don't believe me, just try opening a bag of peanuts at a daycare center.
However, none of this explains why DANMASK-19 has become so controversial. In fact, the vitriolic debate started even before publication. Preliminary results made public on the internet created a social media storm. The study was initially turned down by two journals. Some suggested that the pro-mask lobby may have tried to prevent its publication due to its findings rather than to its science.
12
The turmoil makes clear that DANMASK-19 has revealed fundamental issues about science and its role in society. The study has pitted two ideological groups against one another. Wearing a mask is a minor inconvenience to some, but to others, the cost is symbolic of something greater. One side sees it as the latest example of a continuing trend - a powerful elite of "experts" creating more rules that control individuals and restrict freedom. This faction seeks support by interpreting DANMASK-19 as saying that masks are ineffective, which proves that the motivation is control, not safety. Science is not really the issue. The "experts," in turn, are angry at having their authority and power questioned while others are upset at resistance to their ultimate goal of collectivist uniformity. They seek support for the view that masks are effective, again not for scientific reasons. Instead, they want to label those who oppose masks as anti-science and hence irrational and to prove their moral superiority. Given the reproducibility crisis and the fuzzy controls of epidemiological research, however, they are on very thin ice chastising people for being skeptical of science.
13 While science is the best known method for knowing the world, it is far from objective, perfect or immutable although it is generally portrayed that way. The Levitan (2016) quote above captures the reality. Even the data themselves, to some extent, are already interpretations. This problem is greatest in biological science because it is the most complex and deals with the messiest world. This leaves the most room for biases to slant the data and conclusions.
In sum, both sides are more interested in being right than in the science
per se. They accept science that strengthens their beliefs and attack or re-interpret science that does not. Science has become a Rorschach test where research consumers simply project their prior beliefs on to the inkblots.
14 The controversy surrounding Danmask-19 shows the degree to which science has degenerated into a tool for propaganda in the wider culture war.
Summary
DANMASK-19 shows the difficult in answer apparently straightforward questions about health and safety. Despite the flaws, DANMASK-19 is hardly unique in being loosely control as that is the nature of real-world research. The interpretation of results varies over an amazingly wide range. DANMASK-19 says more about society than it does about whether to wear a mask. Like most epidemiological studies it was so loosely controlled that it provided the inkblots for special interests to project their interpretations on to the data the dispute over DANMASK-19 is not about science but rather about values.
If there is a lesson from DANMASK-19 it is this: don't just accept anyone else's interpretation of the data. Avoid the temptation to simply accept interpretations (and data) if they agree with your preexisting beliefs or ideological stance. Read the study very closely and judge for yourself. If you aren't trained to critique research studies, then just be skeptical and don't rely on any one study. If the study is controversial, ask the question
"Cui bono?"
Endnotes
1Levitin, D. (2016). A
Field Guide to Lies and Statistics: A Neuroscientist on How to Make Sense of a Complex World.
2Bundgaard, H., Bundgaard, J. S., Raaschou-Pedersen, D. E. T., von Buchwald, C., Todsen, T., Norsk, J. B., ... & Iversen, K. (2021). Effectiveness of adding a mask recommendation to other public health measures to prevent SARS-CoV-2 infection in Danish mask wearers: a randomized controlled trial.
Annals of Internal Medicine, 174(3), 335-343.
3This is an oversimplification. Some studies use more sophisticated methods such as pairing each test subject with a matching control.
4The researchers said that the imputed data did not change the results.
5This is calculated by assuming a 2 percent base rate. Out of a thousand people selected at random, .02*1000 or 20 would have Covid-19. The test detects positives for 82.5*20=16.5 of these cases. Of the other 980, they test positive at a rate of 1-.995=.005, so there are .005*980=4.9 false alarms. A positive test then has the probability of 16.5/(16.5+4.9)= 0.771 of being a true. However, a higher assumed base rate would produce better accuracy.
6It is not only real-world research that frequently fails to replicate. Lab studies are sometimes little better. For an overview, see the video
"Is Most Published Research Wrong?". To learn about the extent of the crisis in biomedical science, see Harris, R. (2017).
Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions. For a similar discussion focussed more on psychological sciences, see Ritchie, S. (2020).
Science Fictions: How Fraud, Bias, Negligence, and Hype Undermine the Search for Truth. And strange as it may sound, even mathematics has a replication problem. See
A Replication Crisis in Mathematics?
7The subject needed only show positive on any of the tests to be considered infected.
8The study's confidence interval was large, going from a 46 percent reduction to a 23 percent increase of infection mask wearers.
9For an example, see,
"Covid-19: controversial trial may actually show that masks protect the wearer."
10For example, despite absence of any compelling evidence of a correlation between salt intake and hypertension in the normal population, many "authorities" such as the CDC continue to recommend absurdly low salt diets.
11Pascal's wager is immediately recognizable as a precursor to modern Signal Detection Theory.
12See
The Curious Case of the Danish Mask Study. Perhaps even worse, the authors may have had troubled getting the study published because the it was a negative result which journals generally do not like to publish. This results in the "publication bias," which causes an overestimation in the strength of positive findings because negative results are suppressed.
13Indeed, self-proclaimed skeptics like Michael Shermer are skeptical about everything but science.
14Studies better controlled than DANMASK-19 afford less opportunity for projecting prior belief. But there are few, if any, studies that don't begin to lose shape under close scrutiny. The amount of scrutiny given a study is another form of bias.