Data science is often described as the union of mathematics, statistics, and computer science that aims to conduct research and promote education in the development of scientific methods to extract knowledge and insights from data. Teresa Sullivan’s (2020) article discusses an aspect of that mission—a highly important aspect: the valuable role of statisticians in ensuring the integrity of data. Over the years, statisticians have developed approaches for ensuring data integrity, using methods of data collection, questionnaire design, and subsequent release that ensure both accuracy and anonymity; with changing technology, these methods need continual development and updating.

Statisticians play important roles in issues related to public policy, such as the integrity of data collected from the decennial census. How involved should statisticians become when the issues not only have wide impact on our residents, but also have political implications? Statisticians should use their talents to provide unbiased recommendations in this age of massive data and machine-learning algorithms, perhaps especially when the issues involve science and policy. The decennial census reminds us that statisticians’ input is critical for the proper design of questions in a survey or census. By ignoring this responsibility, we may unknowingly contribute to poor decisions that aﬀect other decisions and many people. The field of statistics must remain relevant and critical to society, particularly when it comes to the integrity and accuracy of government data.

I am one for whom most government data have become so routine as to be unnoticed. I check weather.gov almost daily. And two of my projects depended critically on census data. In one, we needed to stratify the 3,000-plus U.S. counties by some measure of “county urbanization”—number of people exposed, accounting for county land area. Traditional measures such as “total population” and “population density” suﬀered from shortcomings. For example, in 2010 Androscoggin County, Maine, and Napa County, California, had similar populations (107,690 versus 136,794), but diﬀerent population densities (230 versus 388). Conversely, Laramie County, Wyoming, had a population density of 30 persons per square mile in 2010, very similar to that of Clear Creek County, Colorado (23), but a vastly diﬀerent population (92,219 versus 9,605). The logarithm of the root sum of the squared populations of the three largest places in a county met our needs very nicely, and the Census Bureau’s City and County Data Book was indispensable (Goodall et al., 1998). In the other project, two colleagues and I used data from the American Community Survey at the level of the ZIP Code Tabulation Area (ZCTA), with information on not just population (and population density) but also quantiles of educational attainment, age, and household income.1 The availability of recent census data was critical for our project (Lobo, Bonds, & Kafadar. 2019).

Similarly, we routinely rely on incidence and mortality rates collected by the U.S. government’s Department of Health and Human Services, which includes the National Institutes of Health, the National Center for Health Statistics, and the Centers for Disease Control and Prevention, as well as safety statistics from the Federal Aviation Administration, Bureau of Justice Statistics, Food and Drug Administration, and Bureau of Transportation Statistics. It is hard to imagine how we could operate without access to reliable, timely data collected and released by the federal government.

Reamer (2019) provides a link to a list of 316 federal programs that depend on data from the 2010 decennial census, totaling $1.504 trillion, including financial assistance programs, matching payments, and tax credit and procurement programs. Time is short between now and the census, and we all need to pool our collective talents in mathematics, computer science, and statistics to minimize the spread of disinformation about the census and ensure the integrity and accuracy of census data.

Today, current technology aﬀects both the size of data sets that can be stored and transmitted and the potential to de-anonymize protected data, which raises new challenges for statisticians to address. The technological advantages in 2020, versus previous decades, raise two challenges: the potential for invading protected records and the rapid spread of erroneous information about the decennial census. To address the first, statisticians are developing methods of diﬀerential privacy to ensure a desired level of accuracy with a given degree of protection. This research has wide practical implications, not only for activities of the Census Bureau (decennial census, American Community Survey), but also for all data outputs from our Federal Statistical System, and even for data available at the state level.

To address the second challenge, statisticians and computer scientists have begun collaborating to address the research and practical challenges associated with disinformation. Some of these collaborations have led to websites for resources on disinformation2 and for avenues for research,3 but more is needed.

I have little doubt that statisticians can contribute to both eﬀorts. My concern lies with the controversies that we communicate to the public, which may lead people to distrust the Census Bureau, its instruments, its data, and its eﬀorts to ensure accuracy and privacy. We may disagree on the advantages and disadvantages of various algorithms for diﬀerential privacy, but it would be highly unfortunate if the public saw those scientific discussions as ‘even statisticians can’t agree among themselves, so why should we believe any of them?’ We should emphasize the objectives on which we all agree: we need to apply some algorithm, or combination of algorithms, to continue to ensure access to accurate and deidentified federal data on which we have all come to rely.

Should statisticians engage in the debate regarding the need for adequate field testing to assess the eﬀects on accuracy, completeness, and response rates if a citizenship question were asked on the 2020 U.S. decennial census? I believe that as statisticians, we have an obligation to recommend further study when it is warranted, and to urge caution if further study is not feasible—even when we like the proposal. A good example concerns the widespread use of the CA-125 test to screen for ovarian cancer. Gynecologists believed strongly in the value of that test; more than a few women underwent oophorectomies because of a CA-125 test result. But the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial, conducted by the National Cancer Institute, showed no reduction in mortality among screened participants (Pinsky et al., 2016). The trial might have found the opposite. Sometimes appropriate studies are essential.

As noted in the footnote, this article (Sullivan, 2020) arose from the author’s presentation as the American Statistical Association President’s Invited Address at the annual Joint Statistical Meetings (JSM) in 2019.4 I thank Sullivan, not only for an excellent article of importance to statisticians and data scientists, but also for adding so much to JSM 2019.

Acknowledgments

My thanks to David Hoaglin for his comments, which greatly improved this discussion.

Disclosure Statement

Karen Kafadar has no financial or non-financial disclosures to share for this article.

References

Goodall, C. R., Kafadar, K., & Tukey, J. W. (1998). Computing and using rural versus urban measures in statistical applications. The American Statistician, 52(2), 101–111. https://doi.org/10.2307/2685467

Lobo, B., Bonds, D., & Kafadar, K. (2019). Estimating local prevalence of obesity via survey under cost constraints: Grouping ZCTAs in Virginia’s Thomas Jeﬀerson Health District. Manuscript submitted for publication.

Pinsky, P. F., Yu, K., Kramer, B. S., Black, A., Buys, S. S., Partridge, E., … Prorok, P. C. (2016). Extended mortality results for ovarian cancer screening in the PLCO trial with median 15 years follow-up. Gynecological Oncology, 143(2), 270–275. https://doi.org/10.1016/j.ygyno.2016.08.334

Reamer, G. (2019). The role of the decennial census in the geographic distribution of Federal funds, Brief 7: Comprehensive accounting of census-guided federal spending (FY2017), Part A: Nationwide analysis, GW Institute of Public Policy. Retrieved from https://censusproject.files.wordpress.com/2019/11/ipp-1920-1-counting-for-dollars-2020-com

Sullivan, T. A. (2020). Coming to our census: How social statistics underpin our democracy (and republic). Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.c871f9e0

©2020 Karen Kafadar. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Statisticians’ Role in Ensuring Accuracy and Integrity of Federal Data

Acknowledgments

Disclosure Statement

References

Connections