The article by Teresa Sullivan (2020) on the role of the census in the official statistics of the United States of America, and the political issues that it faces, is comprehensive, insightful, and timely. However, reading this article encourages one to think outside the U.S. context, both in terms of the roles that censuses play in other countries, and the future of these censuses given the huge and growing access to digital information about the people and the societies making up these countries. Here we see a rather different picture, alluded to briefly in Section 4.2 of the article, but one with potential implications for the concept of a census as the keystone in a national statistical infrastructure.
As Sullivan so aptly characterizes, the system of official statistics in a country represents a crucial type of infrastructure, one that is defined by the data sources and the data-processing methods that underpin the production and dissemination of statistical information about the country. Maximizing the utility of this national statistical infrastructure requires the capacity to use the data collected to provide snapshots of the social, economic, and geographical characteristics of the country's population at distinct points in time, as well as the capacity to track the dynamics of these characteristics. This in turn implies the capacity to construct ‘census-like’ or ‘full population’ data registers at regular time intervals, as well as the capacity to link data for individuals, households, and other social structures across these time intervals on these dynamically changing registers. This also provides the infrastructure for linking in data from other, more specialized registers, as well as data from continuing and one-off surveys of the population. In effect, the ‘census register’ is the key for the construction of a spatiotemporal ‘data spine’ that, through linkage to more specialized data sources, can serve as the foundation for provision of population-representative cross-sectional and longitudinal data at regular intervals, covering health, genetic, demographic, education, social welfare, and socioeconomic dynamics. It also provides the capacity to identify and to follow up key population cohorts, as well as to allow rapid increases in sample size as necessary to produce representative and highly granular ‘small area’ cross-sectional and longitudinal analyses.
Creation of such a data spine in order to underpin U.K. social science research was the leading recommendation in the 2017 Longitudinal Studies Strategic Review commissioned by the U.K. Economic and Social Research Council (ESRC). To quote from this report: "Policy relevant research requires longitudinal data that is representative of a rapidly changing UK population. This in turn requires the future design of ESRC longitudinal data resources to be capable of dynamically representing this population. We therefore recommend that the ESRC develop and maintain a longitudinal administrative data spine with maximum population coverage that can be used as the basis for data linkage and as a sampling frame for its longitudinal surveys." For more detail see Davis-Kean et al. (2018).
The concept of a data spine underpinning the social and economic statistics of a country is not new, with a large number of national statistics organizations seriously considering (and in many cases implementing) variants of register-based censuses, including combining registers with ‘light touch’ censuses and sample surveys. This has been driven in large part by decreasing organizational budgets, rapidly increasing costs associated with standard enumeration-based census-taking exercises and increasing levels of noncontacts and nonresponse. For example, Statistics New Zealand (Stats NZ, 2019) has developed an integrated data infrastructure by probabilistically linking a number of administrative registers, including those for births, deaths, taxation, and immigration, to define a data spine corresponding to an "ever-resident" population. In turn, this spine has been used to synthesize a residence-based census of the New Zealand population by applying appropriate inclusion and exclusion rules to the individual records on the spine. As noted by Choi (2019), this purely register-based synthetic census has coverage issues. In theory these can be corrected by linking N.Z. census records to the spine, but in practice linkage errors make this problematic. Problems with the much larger than expected differential undercoverage experienced in the 2019 N.Z. census (Munro, 2019), a consequence of trying to make online census returns the default option, muddy these waters somewhat and serve to emphasize the need for an accurate population register that can play the role of the spine. This seems particularly important given that Stats NZ is investigating future censuses based on integration of data sources rather than population enumeration. A similar move is underway at the Italian National Institute of Statistics, which intends to replace a traditional enumeration census carried out every 10 years, plus more frequent population surveys, with probabilistic linkage of different administrative data sources to create a national data spine or population register, with official population surveys then based on the spine, including production of all census outputs.
In the Nordic countries (Norway, Sweden, Finland, Denmark, and Iceland), longstanding availability of continuously updated and comprehensive population registers, combined with unique population identifiers, make census-taking unnecessary because these registers are easily linked to define a data spine and hence provide all the information provided in a traditional census. A variant on this situation is the Dutch virtual census, which has been implemented every 10 years since 1981. This also works by linking together various administrative data sources (again using a unique population identifier) to create a population register. However, the information on this register is insufficient to meet census requirements, and so this ‘missing’ information is sourced from contemporaneous population surveys by reweighting their data in order to line up the resulting survey estimates with the coverage of the population register.
Similarly, although it will still conduct a traditional enumeration-based census in 2021, the Office for National Statistics in the United Kingdom has signaled that it intends to move to some form of registers-based census in 2031. One possibility would be to base this on the population information contained in the various registers underpinning the U.K. National Health System (NHS) (which could then define a U.K. data spine). However, this spine will not contain the breath of information required of a census and so further data sources would need to be linked to it.
Intrinsic to any linked register strategy for a census is the realization that a census is essentially a set of statistical outputs. That is, it is the result of collection and manipulation of data from various sources at a point in time to produce highly granular estimates covering a specified population and for a range of census items that can be related to similar estimates produced in previous censuses of that population. In effect the coverage, granularity, and longitudinal nature of census outputs are important, and not the methods used to obtain the contributing data. All that matters is that these data are of the required quality. From this point of view censuses are part of the statistical infrastructure of a nation, but are not necessarily distinct from other components (registers, surveys) of this infrastructure. What is important is the capability to use the infrastructure to create census outputs with the properties described above. This stands in sharp contrast to the current situation in the United States, where an enumeration process is constitutionally required, and so there can be no integration-based approach to producing census outputs. But, as Sullivan (2020) points out, this state of affairs may not be fixed in stone. The current U.S. government has instructed that an alternative population register be created, by linking existing official registers, in order to provide the administration with fine-level information on the geographic distribution of U.S. citizens. Putting to one side the actual feasibility of this project, this does raise issues about how the information contained in such a register could be used to influence how census enumeration data (which does not take account of citizenship) are used to define political boundaries and allocation of federal funding. In her article, Sullivan hints that in the long term a data spine for the United States could become its statistical infrastructure keystone, based possibly on a Medicare data register covering the entire U.S. population (as with the NHS in the United Kingdom). But this will require a huge reorientation of the way that the statistical infrastructure in the United States is currently organized. In such a scenario it is hard to see how the population enumeration approach of the current U.S. census will continue to be justifiable, except as a legal requirement dictated by the U.S. Constitution.
There can be considerable financial advantages to replacing an enumeration-based census infrastructure with a linked register-based infrastructure, provided the latter is feasible. However, there are also risks. Many of these are shared with those discussed in Section 4 of Sullivan's (2020) article, including popular mistrust of the integrity and confidentiality of the individual data stored on a data spine, as well as the possible misuse of these data by a government. To these can be added the very real risk that a data-linking approach to development of a spine requires a considerable degree of cooperation and collaboration between the different government agencies that ‘own’ the component registers used to build the spine. In many cases (and I suspect the United States is one) there is little history of such cooperation, and in fact there can be well-developed ‘silo’ mentalities within agencies, usually justified by the fact that the registers are required for agency operations that have nothing to do with national statistical infrastructure. In effect, the stability, security, and veracity of these registers becomes an issue. An additional issue is that information stored on these component registers needs to be comprehensive enough so that standard statistical concepts can be used to extract census values from the linked register. This is particularly true for linking variables, which need to have harmonized definitions on the component registers. This is not as important where a unique population identifier exists, but for countries like the United States with no tradition of population identifiers, lack of harmonization in linkage variables can lead to unacceptable linkage errors.
Census-like activities globally are in a state of flux. Countries with existing strong statistical registers and unique population identifiers no longer conduct enumeration-based censuses, while those with an increasing level of register activity are moving to strengthen their register systems and to integrate them, with the aim of moving away from expensive enumeration-based censuses. In effect, they plan to build new statistical infrastructure that takes advantage of the explosive growth in individual data capture, storage, and access over the last decade. Sullivan's (2020) article implies that the United States now faces the dilemma of continuing with its extremely high-quality but also extremely expensive enumeration strategy for the U.S. census, or trying to move to something less expensive, involving a degree of register integration, but that is still able to create the detailed census outputs required. This is complicated by a constitutional requirement that there be a "count of persons" every 10 years. Whether this means that the U.S. census will continue to be a purely enumeration-based exercise, or whether some compromise can be reached on this legal requirement, remains to be seen. In any case, Sullivan is absolutely correct in her assessment that the integrity of the statistical outputs produced under either scenario will depend on the integrity and professionalism of the statisticians involved, and the transparency of the U.S. statistical agencies that will be responsible for these outputs.
Davis-Kean, P., Chambers, R., Davidson, L., Kleinert, C., Ren, Q., & Tang, S. (2018). Longitudinal studies strategic review: 2017 report to the Economic and Social Research Council. Retrieved from https://esrc.ukri.org/news-events-and-publications/publications/corporate-publications/longitudinal-studies-review-2017/
Choi, H. (2019). Adjusting for linkage errors to analyse coverage of the administrative population. Statistical Journal of the IAOS, 35, 253–259. https://doi.org/10.3233/SJI-180483
Munro, B. (2019, March 4). And then there were nine. Otago Daily News. Retrieved from https://www.odt.co.nz/lifestyle/magazine/and-then-there-were-nine
Sullivan, T. A. (2020). Coming to our census: How social statistics underpin our democracy (and republic). Harvard Data Science Review, 2(1).
Stats NZ. (2019). Integrated data infrastructure. Retrieved from https://www.stats.govt.nz/integrated-data/integrated-data-infrastructure/
This article is © 2020 by Ray Chambers. The article is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the authors identified above.