Building Intuition Regarding the Statistical Behavior of Mass Medical Testing Programs

The quality of medical decision-making and public health planning alike depends directly upon understanding the accuracy of medical tests, especially during a pandemic. But the statistical concepts and measures used to assess test accuracy can be confusing. Why is there not one single definitive measure of test accuracy? How much should individuals worry about spreading COVID-19 if their test results are negative? What do sensitivity, specificity, false positive results, false negative results, and positive predictive value mean relative to each other? In this tutorial, we clarify the meaning of these terms in intuitive ways via visual illustrations, and explain how these terms are all connected to one another through Bayes’ theorem. We show how to use the relationships in that theorem to assess personal risk when large numbers of people are being tested. We illustrate as well the extent to which the accuracy of large numbers of tests depends on the proportion of those tested who have the disease. Overall, we aim to heighten a general intuition regarding the performance of mass medical testing campaigns. Here, toward that end, we review different ways to measure the accuracy of diagnostic tests with reference to pandemic-specific examples.


Introduction: The Vocabulary of Outbreaks
The COVID-19 pandemic caused a massive spike in the broad-scale, governmentimplemented diagnostic testing of large numbers of people. Researchers and officials have used the data generated by such testing programs to highlight two important numbers: namely, those showing the incidence and prevalence of the disease. The incidence describes how many new cases occurred in a recent time period, for example, the past week. The prevalence describes how many people are currently infected right now. Prevalence may be larger than incidence as it includes new cases from the past week but also individuals who were infected during the prior week and have not yet recovered. Important though these two numbers are, however, they are only part of a larger story that can be difficult for researchers, health workers, and the general public to fully understand. The planning of broad-scale testing programs is challenging to implement, and the results can be challenging to interpret. To begin with, the task of making sense of test reports at the individual, regional, national, and international levels requires connecting different types of data, different types of tests, and different assessments of test performance. Beyond that, the terminology used in this context can present obstacles. For example, press reports detailing the results of broad-scale testing programs often rely on words such as 'sensitivity' and 'specificity,' which connote 'accuracy' in everyday usage. But in epidemiologic reports that describe the performance of testing programs, sensitivity and specificity have different, specialized meanings. Hence, sorting through statements regarding test performance in varied media can be difficult, especially when seemingly counterintuitive results are quoted without much explanation. Two recent examples vividly demonstrate this challenge: "You might think any test with 95 percent sensitivity and 95 percent specificity would be highly accurate. But while these would be great grades on an organic chemistry final, the ability of such a test to render a reliable result is extremely poor: 50 percent of the positive results would not be true positives, Dr. Osterholm said. (You'll have to take my word for this-explaining the statistics would require half a column!)" (Brody, 2020) "The CDC explains why testing can be wrong so often. A lot has to do with how common the virus is in the population being tested. 'For example, in a population where the prevalence is 5%, a test with 90% sensitivity and 95% specificity will yield a positive predictive value of 49%. In other words, less than half of those testing positive will truly have antibodies,' the CDC said.…Alternatively, the same test in a population with an antibody prevalence exceeding 52% will yield a positive predictive value greater than 95%, meaning that less than one in 20 people testing positive will have a false positive test result." (CNN Wire, 2020.) Both press reports make true but confusing statements, and implicitly prompt questions. How can a test with two measures of performance at or above 90% be wrong more than half the time? Constraints on space and scope in popular media rarely allow writers and readers to develop insight or build intuition into why these seemingly counterintuitive associations occur. It is no wonder that nonexperts may end up being confused. Indeed, even statisticians and epidemiologists familiar with terms such as sensitivity and specificity often struggle to explain them to colleagues and friends. Here, we aim to provide definitions and examples to illustrate the relationships between epidemiologic terms such as sensitivity, specificity, and prevalence. These relationships are important for the design and assessment of testing strategies, programs, and protocols. Building intuition around how and why certain strategies work well (or not) is critical for monitoring the ongoing pandemic, communicating results on current status, and targeting vaccine rollout to areas of high incidence and prevalence. Misunderstanding or miscommunicating testing results can result in inefficiencies at best and increased mortality at worst.
In considering the examples below, we suggest keeping two key questions in mind regarding any reported percentage: 'What question does that percentage answer?' and 'From what population is the percentage taken?' We consider first the impact of decisions regarding who is tested on the numbers of cases reported. Next, we consider the main qualities of an 'accurate' test (that is, sensitivity and specificity), and explain how and why these qualities are related.

Patterns in Data Reflect Patterns of Disease and Patterns of Testing
Through daily reports on the number of individuals testing positive for COVID-19, public health surveillance provides critical information regarding which absolute counts of infection, hospitalization, and death are increasing, which are leveling off, and which are declining. Surveillance data typically come from a variety of sources, including testing, hospital records, and mortality records. The regional and temporal patterns observed in combined surveillance data pertain not only to infection but also to varied modes of reporting and testing, such as what types of data are included, which individuals are tested, how quickly regions and hospitals report data, and whether the separate data elements cover similar time periods.
When an individual is tested for an infectious disease, the test result typically determines whether a person has or does not have evidence of infection. Some diagnostic tests report whether the individual is currently infected, and others test for antibodies or other clues that the individual was infected in the past. When we add up results for individuals at the regional or national levels, it is important to consider the patterns of testing and of the testing results. Early in the pandemic, when resources were scarce, testing focused on individuals with symptoms, those with known contacts with infected individuals, and frontline health care workers.
Patterns of tested individuals within a given city/region/state/country reflected local decisions and priorities for testing: specifically, which tests were most important for making critical decisions at that point in time in that region? For example, when local tests were limited and the policy goal was to identify the number of severe cases that then currently required or would soon require hospitalization, the testing of asymptomatic individuals may have been viewed as less of a priority than the rapid identification of severe cases and contact tracing.
In this scenario of limited testing, the (appropriate) focus of tests on those with severe symptoms can result in selection bias with respect to simple estimation of the proportion of population currently infected via the proportion of positive tests among those tested. As an example, suppose a clinic has 10 remaining test kits and 20 people waiting to be tested, seven of whom currently have a high fever and are short of breath (symptoms suggestive of . For simplicity, suppose the other 13 patients are not infected. In order to prioritize treatment for the sickest individuals, we would assign the first seven remaining COVID-19 tests to these seven symptomatic individuals and then test three of the remaining individuals. If seven of the 10 tests were positive for COVID-19, the percent testing positive would be 70% (seven positives out of 10 tests); however, only seven out of 20 people in line (35%, seven infected out of 20 people) were actually infected. In this example, the individuals with severe symptoms were (again, appropriately at the time) more likely to be tested and more likely to be positive, yielding a proportion of positive tests much higher than the proportion of infected individuals. Here, as elsewhere, accuracy in understanding and reporting the proportions of positive tests would have required reporting the number of positives out of the number tested and reporting the priorities by which individuals were chosen to be tested.
As tests became more available over the course of the pandemic, policy goals shifted to the assessment of what proportion of the regional population was infected at a given time (the current prevalence of the disease, including both new and ongoing infections). Generally speaking, to avoid selection bias, the most accurate estimates of current prevalence should involve testing a random sample of the population containing both infected individuals and uninfected individuals. The proportion of sampled individuals who are infected would then accurately reflect the proportion of infected individuals in the population (the prevalence). Very few efforts were put in place to develop national-or state-level sampling-based estimations of infection prevalence. And, eventually, as tests became more widely available, the goal shifted from testing random individuals in a resource-limited environment to testing everyone in particular settings when testing had become more widely available (e.g., on university campuses, the state of Georgia, the U.S. military). Some public health researchers extended these ideas and proposed even broader proposals for mass application of rapid tests. At the national or international level, these changing goals led to a global map of information influenced by different patterns of testing in different areas. In order to facilitate accurate interpretation of changing local prevalence values over time, analysts, decision makers, and the interested public need to understand these shifting priorities for testing and data collection over the course of the pandemic.

What Are the Qualities of an 'Accurate' Test?
Diagnostic tests are not perfect. Some uninfected people will have positive test results.
Such results are referred to as 'false positive results' or 'false positives.' In addition, some infected people will have negative test results, referred to as 'false negatives.' In this context, we can observe one of four outcomes. Two of these are correct outcomes: The other two possibilities reflect different types of incorrect outcomes: An 'accurate' test will have a high proportion of correct outcomes and low proportions of each type of incorrect outcome. But in practice, a low proportion of false positives does not necessarily mean a low proportion of false negatives. As an extreme example, Bayes theorem is presented as a 'gotcha' homework problem in introductory biostatistics and epidemiology classes to show students how some very common intuitions about probability are actually very wrong. Here, rather than present the theorem first as an equation to explain associations between different measures of test accuracy, we will instead begin by illustrating the associations through examples and then summarizing these with the theorem.
Probability abounds with confusing examples and seeming 'paradoxes,' which is why gambling is a profitable venture for the house. Many of these paradoxes result in a difference between what we want to happen (for example, 'I think this slot machine is due for a payout'), and what the randomization mechanism driving outcomes actually generates (each spin is equally likely to pay out). The setting of mass testing is no different. We want a test with good properties in one arena to be good in another. As a result, we often attempt to attribute good results in one subset of performance measures to the rest.

Illustrations With COVID-19 Testing
To better understand different measures of test performance and their interrelationships, we consider the example of two broad types of testing for COVID-19. The first type of test is used to identify an active infection by extracting RNA, converting it into complementary DNA, and assessing whether that DNA (but no other) can be amplified using a technique called real-time polymerase chain reaction. A second type of tests seeks biological markers of immunity within an individual due to prior exposure rather than an active infection by assessing the presence of antibodies to the virus. The latter test (often referred to as the 'antibody test') led some federal and state government officials to propose that mass antibody testing can identify those who have been infected and are presumably immune and safe to return to work (Mukherjee, 2020). (See CDC Guidance on Antibody Testing,). In short, the RNA test determines if the individual is currently infected and the antibody test determines if an individual has been infected in the past. Additional rapid testing approaches continue to be under active development. For ease of description in the sections below, we focus on the antibody test, although the concepts apply to tests of current infection as well.
The first antibody test approved for emergency use by the Food and Drug Administration (FDA) was made by Cellex. Cellex's test reports a sensitivity of 93.8% and a specificity of 95.6%. We use these values to define and interpret these two summaries of test performance.
In the definitions below, we consider several different proportions of different subpopulations. The key to keeping the definitions straight is tracking which subpopulations go with which measures of performance. We note that this is easier said than done.
To begin, the sensitivity of the antibody test is the probability that a person tests positive given they have COVID-19 antibodies, which can be written as a conditional probability: (where we read " " as an individual tests positive given that they indeed are "disease positive" , that is, they have antibodies represents the proportion of tested individuals with antibodies who would falsely presume that they are still susceptible and take unnecessary care not to become infected (e.g., stay out of work when in fact they could safely return if long-term immunity holds).
Next, specificity is the probability that a person tests negative ( ) given that they are    these relationships, we summarize these terms, their definitions, and probability notation in Table 1.   T +

Pr [A|B]
Building who currently have antibodies. In a mass testing situation, the prevalence is defined to be the proportion among those tested. This may be different from the proportion of the entire population with antibodies due to testing priorities and strategies. This clarification is necessary since, as discussed above, COVID-19 tests are allocated based on testing priorities, due to local resources, availability, and access.
The prevalence of a disease in the tested population affects the associations between sensitivity, specificity, and positive predictive value. While we typically think of sensitivity and specificity as properties of a particular test, in a large-scale mass testing setting their relationships to each other change with the background prevalence of disease. These relationships change because prevalence changes the relative sizes of the subpopulations defining each measure of test performance.
To illustrate this connection, note that in Figure 1 we earlier assumed a prevalence of 50% in those tested (we had 10 red heads with antibodies and 10 blue heads without antibodies). Now, consider Figure 2 where we apply the same tests (80% sensitivity and 90% specificity) to a prevalence of 20% in the tested population (10 red heads with antibodies and 50 blue heads without antibodies). As before, we first arrange people by disease status (people with antibodies together and people without antibodies together), then we have the same people rearrange themselves by test result (positive tests together and negative test together).
When we rearrange by test result, we find that the positive predictive value has changed from 88.9% in Figure 1 to 61.5% (only 8 out of 13 positive tests occur in people with antibodies). As we test more people without antibodies, the number of false positive test results begins to increase. The percentage of positive tests in individuals without antibodies remains the same, but since we are testing many more people without antibodies, the number of such false positive test results increases. If we were to test even more people without antibodies (say, 100, 1,000, or 1,000,000) but still tested 10 people with antibodies (that is if the prevalence of people with antibodies decreases in the tested population), at some point we could (and would) observe more false positives (positive test results in people without antibodies) than we would observe true positives (positive test results in people with antibodies) !  Figures 1 and 2 suggest that there is a specific relationship linking sensitivity, specificity, positive predictive value, and prevalence of the disease within the tested population. The figures suggest that the key to understanding the relationship among these quantities depends on how we group people into subpopulations, the relative sizes of these subpopulations, and the proportions of incorrect tests within each subpopulation. We next use three separate but related descriptions ranging from conceptual to mathematical to illustrate these relationships in more detail. Each example provides a bit more insight into the interrelationships and their relevance to better design, implementation, and understanding of the results for large-scale diagnostic testing.

Illustration 2: Tracking Possible Outcomes
As a next step, we move away from the simplified settings of Figures 1 and 2 Our goal is to track all of the possible outcomes of testing that could happen and arrange these in a 'tree' or flowchart of possible events (Figure 3). The branches on the flow chart provide more detail on subgrouping than we showed in Figures 1 and 2.   Figure 1C). Similarly, the false discovery rate, or , is 0.82. Thus, at the 1% prevalence level, only 18% of people who test positive will actually be positive, and 82% of people who test positive and think they have antibodies will be wrong! The expected distribution of 10,000 people through this tree of events when prevalence is 1%, sensitivity is 93.8%, and specificity is 95.6% (based on specifications of the first antibody test to achieve Food and Drug Administration [FDA] emergency use authorization). The probabilities and numbers can be used to calculate (C) the positive predictive value, and (D) the false discovery rate.
the suggestion from Figures 1 and 2 that the impact of the background prevalence, , is essential to understanding the performance of a mass testing campaign.
Why would this matter? For some diseases, a false positive result simply results in a healthy person receiving an unnecessary treatment. If the treatment is not too toxic, then this is not particularly consequential. However, in the case of COVID-19, where prevalence among the tested was small early in the pandemic, the association between the different test criteria can be critical. If the prevalence of the disease among those tested is very small, we see that the majority of people who test positive for antibodies will falsely think that they are immune. Under this mistaken assumption, they may leave themselves open to the risk of severe disease and death, as well as maintaining transmission among the general public.
In order to provide insight on how decisions regarding whom to test can influence the performance of a mass testing strategy, we recall that the prevalence represents the proportion of individuals tested who have antibodies. If individuals are selected from the general public in a random sample (such that every person is equally likely to be tested), then the prevalence in the tested population will be a close approximation to the overall prevalence in the antibodies in the general population. If, however, we focus testing on individuals who have had close contact with infected individuals, the prevalence in the tested population will be higher than the prevalence in the general population and our positive predictive value will improve. While we may not know the precise prevalence in the general population, we may exercise some control over the prevalence in the tested population by focusing on tests for higher risk individuals.
Next, we examine how sensitivity and specificity affect the relationship between prevalence among those tested and the positive predictive value of a test.

Illustration 3: How Prevalence Changes Positive Predictive Value
To clarify the impact of prevalence on the positive predictive value, we extend the COVID-19 example used in Figure 3 for prevalence ranging from 0% to 100%. In + completely unchanged as we increase sensitivity from 93.8% to 95%, 96%, 97%, 98%, and 99% (overlapping red lines in Figure 4B). The blue lines in Figure 4C, however, show that changes in specificity from 95.6% to 96%, 97%, 98%, to 99% lead to substantially higher positive predictive values at any given prevalence value.
This example demonstrates that we can maintain a better positive predictive value for the tested population if: (1) we focus a testing program on those at higher risk (that is, we aim to have a higher prevalence of antibodies among the tested individuals than we would if we tested at random) and (2) if we increase test specificity (that is, if we have a choice between different tests, we choose the one with higher specificity). The plots also show that the positive predictive value is relatively unaffected by changes in test sensitivity. Figure 4. The relationship between the prevalence of antibodies among those tested and the positive predictive value. In (A), we see increasing positive predictive value with increasing prevalence (prevalence of 1% to 5% indicated by vertical dashed lines. In (B), we see the minimal impact of increasing sensitivity from 95% to 99% (red lines).
In (C), we see the improvement in positive predictive value associated with increasing specificity from 96% to 99% (blue lines).
The curves in Figure 4 further stress the critical role of prevalence in assessing the performance of a mass testing campaign. Ongoing, sampling-based prevalence estimates can (and should) provide improved situational awareness of the proportion of the population currently infected (for PCR tests) or infected sometime in the past (antibody tests). However, assuming the absence of such surveillance-based estimates of the current level of COVID-19 prevalence at the national or local levels, Figure 4 still provides important information for planning and expectations of performance for population-level testing because it illustrates the impact of changes in sensitivity and specificity for any value of prevalence. For example, if the best information suggests a prevalence (current or past) of 5 to 10%, we can examine the relationships in Figure 4 across this range to gain insight on the performance of the testing program.

The Formulas Behind the Figures
As mentioned above, the probability tool defining the relationships between our testing performance measures is Bayes' theorem, best understood as a consequence of the mathematical definition of conditional probability. First notice that the joint probability The denominator, is the probability that a randomly selected person receives a positive test, which can be calculated as the probability of a positive test for the people who test positive and are positive plus the probability of a positive test for people who test positive and aren't. In the notation above, this can be written as,

Discussion
The descriptions above illustrate concepts, graphs, and formulas summarizing the challenges of different but related measures of diagnostic performance in a largescale, mass diagnostic testing program. Given these challenges, we must take care when reading, writing, and understanding descriptions of testing systems to frame issues in terms of the specific questions answered and to consider how sensitivity and specificity relate to the expected proportions of results in different subpopulations.
Claims, reports, and manuscripts that deal with testing systems must be considered in The impact of prevalence on mass testing programs sometimes is referred to as the 'base-rate fallacy' (where the base-rate corresponds to the prevalence among those tested) and has a long history in the diagnostic testing literature (Bar-Hillel, 1980).
When designing a mass testing system, we can protect against this problem to some degree by choosing tests with higher specificity or by opting to test a greater proportion of those likely to have the disease (for example, testing those with confirmed contacts with infected individuals) thereby yielding a higher prevalence among tested individuals (Service, 2020;Watson & Whiting, 2020;Woloshin et al., 2020). Other adjustments in practice include multistage testing wherein all positive tests are followed up by a second round of testing. By focusing the second round of testing on only those with positive results in the first round, we are effectively testing a subgroup with higher prevalence, with the effect that each round can improve in performance. Such approaches have been implemented in many university settings where positive antigen tests are followed up with 'more accurate' PCR tests that have higher specificity. While our examples are based on COVID-19, the same issues arise whenever large groups of people are tested. For example, consider the question of whether we should extend recommendations for routine mammogram screening for breast cancer for women or PSA screening for prostate cancer in men to younger ages? Like most cancers, the prevalence of both of these types of cancer increases with age. Adding younger individuals to the testing pool will lower the overall prevalence in the tested population and, as seen above, will lower the positive predictive value of tests. These are challenging decisions: Each case detected early is important, but many false positives will likely sour the population on participating in screening programs.
The definitions and descriptions here provide tools for exploring what levels of prevalence in the testing population will provide adequate performance for both individual tests and the testing strategy in general. In summary, the evaluation of testing programs requires clear thinking and careful reporting. While a statement such as 'Applying a test with 99% sensitivity and 95% specificity can result in over 50% false positives!' makes for provocative reading, it compares different percentages of different subpopulations (individuals with antibodies, individuals without antibodies, and those receiving positive tests) and it omits any mention of prevalence. When writing, we should be very careful to specify definitions and the groups to which percentages refer. When reading, we should ask 'of what?' after every percentage, and recall the specific questions that each term answers as well as the specific subpopulation to which each term refers.

Disclosure Statement
The authors have nothing to disclose.