Objective statistical information is vital to an open and democratic society. It provides a solid foundation so that informed decisions can be made by our elected representatives, businesses, unions, and non-profit organizations, as well as individual citizens. There is a great shift towards a more virtual and digital economy and society. The traditional official statistical systems are centered on surveys, and must be adapted to this new digital reality. National statistical offices have been increasingly embracing non-survey data sources along with data science methods to better serve society.
This paper provides a blueprint for the application of data science in a government organization. It describes how data science enables innovation and the delivery of new high-value, high-quality, relevant, and trusted products that reflect the ever-evolving needs of our society and economy. We discuss practical operational considerations and impactful data science applications that supported the work of Statistics Canada’s analysts and front-line health agencies during the pandemic. We also discuss the innovative use of scanner data in lieu of survey data for large business respondents in the retail industry. We will describe computer vision methodologies, including machine learning models used to detect the start of buildings construction from satellite imagery, greenhouse area and greenhouse production, as well as crop types detection. Data science and machine learning methods have tremendous potential, and their ethical use is of primary importance. We conclude the paper with a forward-facing view of responsible data science use in statistical production.
Keywords: National Statistical Office, data science applications, machine learning, statistical information, alternative non-survey data, government organizations, hub and spoke model, responsible data science use
Data science methods have tremendous potential. This article provides a blueprint for the application of data science in a government organization, and can inspire other data scientists or public sector organizations to use innovative data science methods for a positive impact.
Statistics Canada is Canada’s national statistical office. The agency is data-driven, and data are at the core of the organization’s business. The agency produces statistics that help Canadians better understand their country’s population, resources, economy, society, and culture. Objective statistical information is essential to an open and democratic society. It provides a solid foundation so informed decisions can be made by our elected representatives, businesses, unions, and nonprofit organizations, as well as individual Canadians.
Data are a necessary part of statistics, enabling the possibility to make sense and provide for comparative trends over time for various phenomena. With the shift toward a more virtual and digital economy and society, the demands for more timely, detailed, and integrated statistics have grown significantly, exacerbated by the demands during the COVID-19 pandemic. We must now adapt to be able to better use these data, in responsible ways, and better serve society. Data science methods facilitate the shift toward the use of new data types.
We describe how data science enables innovation and the delivery of new high-value, high-quality, relevant, and trusted products that reflect the ever-evolving needs of our society and economy. We will also discuss practical operational considerations, and how we use a hub and spoke organizational model to conduct data science activities. This article will also provide the key elements and the methodology used for the development and delivery of data science products. These are intended as thought-provoking examples providing some of the rationale that led the organization to adopt these approaches.
Most importantly, we will describe in detail impactful data science applications that enabled us to support front-line health agencies and Statistics Canada’s analysts in the fight against COVID-19. We will also discuss innovative machine learning applications that allowed us to use scanner data in lieu of survey data for large business respondents in the retail industry. The use of data science is fundamentally transforming the way statistics are produced. For instance, state-of-the-art computer vision methodologies along with machine learning models, are used to detect the construction start of buildings from satellite imagery, size and coverage of greenhouses and greenhouse production, as well as crop types. We will also mention other data science applications used to extract information from unstructured documents to automate time-consuming manual work.
The article concludes with a forward-facing view of data science and its responsible use in statistical production.
Statistics Canada is the national statistical office of Canada. The agency is data driven, and data are at the core of the organization’s business. The agency’s main objective is to produce statistics that help Canadians better understand their country’s population, resources, economy, society, and culture. In addition to the census, which happens every 5 years, Statistics Canada collects data from about 450 active surveys on almost all aspects of Canadian life. Objective statistical information is essential to an open and democratic society. It provides a solid foundation so that informed decisions can be made by our elected representatives, businesses, unions, and nonprofit organizations, as well as individual Canadians.
Society is becoming more complex and the fourth industrial revolution is driving change in all aspects of life. Technological advancements are generating volumes of data we have never seen before. Satellites and mobile phones, sensors and Internet of Things devices, robotics and automation, Global Positioning Systems and 5G telecommunication networks, higher compute power on smaller and smaller chips along with e-commerce and social media—all continue to create high-volume, high-variety, and high-velocity data. The speed in which these changes are happening is unprecedented, and staying relevant is important—especially in the context of the COVID-19 pandemic. Organizations such as Statistics Canada need to adapt now to address the growing societal needs for timely analytical outputs. Advancements in the field of data science, and more specifically in machine learning, will provide solutions to some of these new challenges.
Data science is an interdisciplinary field that uses scientific methods and algorithms to extract information and insights from diverse data types. It combines domain expertise, programming skills, and knowledge of mathematics and statistics to solve analytically complex problems.
In the space of data science, we use variety of methods and tools for data analytics and data pipeline creations to obtain insights from data. One of the most promising tools within data science is machine learning. These are algorithms that allow “computers to automatically learn from experience instead of relying on explicitly programmed rules, and generalize the acquired knowledge to new settings”(Yung et al., 2018). In essence, machine learning automates analytical model building through optimization algorithms and parameters that can be modified and fine-tuned.
Machine learning algorithms enable the processing of nonsurvey data, such as large and unstructured data. This is also referred to as big data. For example, by using machine learning algorithms, millions of product descriptions obtained from retail stores can now be classified into standard classifications in a fast and accurate way. Similarly, satellite images can now be used to detect greenhouses and construction sites. These examples demonstrate the need to quickly adopt new data science methods to provide timely, trusted, and high-quality data that inform citizens, government programs, policies, and services.
In addition, data engineering is becoming a distinct area of expertise, being built at the intersection between data science and software engineering. As larger amounts of data are being integrated into data processing pipelines, the importance of how data ingestion is handled is increasing, including data wrangling and data preparation for diverse data types and formats. At Statistics Canada, the need for data engineering expertise has visibly augmented in the last years.
The shift toward a more virtual and digital economy and society means we are spending more time online. Our phones are being used to order dinner, request a ride, deposit a check, track steps, or to order goods and services. We are also spending more time on the internet and social media. The traditional statistical systems centered on surveys must be adapted to this digital reality. Using alternative data sources is no longer just a desirable option, it has become a necessity. This is why Statistics Canada is embracing new digital data sources to better serve society. Data science methods facilitate the shift toward the use of new data types.
Data science and machine learning are not fundamentally new (Zinsmeister et al., 2019). Statistical agencies have been using modeling techniques and data analytics for a long time. These techniques include modeling for stratification, imputation, and estimation purposes such as small domain estimation, model-based editing, and imputation processes.
However, modeling methods have evolved over time and today their focus is on big and unstructured data processing needs. Improvements to modeling techniques were possible due to the increase and availability of computational capacity, such as GPUs (graphics processing units), as well as more efficient data ingestion tools that manage effectively the RAM (random access memory) and CPU (central processing unit). It is worth noting that improvements to modeling techniques were also driven by increased access to large volumes of structured and unstructured data from text, images, videos, sensors, and other sources.
Working at the leading edge to use modern and automated methods is not new at Statistics Canada. In fact, this has been a tradition that dates back to its early days when it was using electrical and compressed-air tabulating machines before the advent of computers. Thanks to the leadership of Chief Statisticians such as Martin B. Wilk, who pushed for more pragmatic approaches still resting on solid science, and Ivan P. Fellegi, who strongly encouraged and led research and development in many areas of survey methodology in a way that balanced relevance and rigor, the table was set to go beyond the pure design-based initial context of survey theory and practice that had been in place since the 1940s. For example, the pioneering work Fellegi et al. (1976) accomplished on edit and imputation paved the way to large amounts of developments in estimation under nonresponse (Lee et al., 2002), a topic that needed researchers and practitioners to think outside traditional approaches. In terms of statistical methods, Statistics Canada, led by methodologists, increasingly explored new uses of administrative data, and this also meant increasing reliance on modeling. Significant efforts in research led to the development of frameworks in the context of complex designs and complex parameters (Binder, 1983), and the introduction of the use of models. This continued with the support of experts on model-assisted survey sampling (Särndal et al., 1992) and then on small area estimation (Rao & Molina, 2015). With this rich history of use of nonsurvey data (administrative data used to support edit and imputation, calibration, or as auxiliary information for small area estimation) and modeling techniques, the natural next step was to consider big data and data science.
Another important factor that has contributed to the dramatic expansion of data science capabilities is the creation of the open source ecosystem. Today, data scientists from around the world collaborate on new algorithms and share open source code that may have taken years to come up with. Implementing these open source tools accelerates development, reduces project costs, and results in faster turnaround times, allowing projects to move from the development stage into production quickly.
The agency’s use of data science in its modern forms began in 2017 and was motivated by the agency’s modernization efforts (Arora, 2018), as well as the emerging need for big data processing. Retail scanner data had been acquired and 10s of millions of records needed to be processed on a weekly basis, classifying products in order to produce monthly estimates. The fast evolution of data science methods in academia and the private sector, along with the business needs for processing newly acquired big data, had shown the need to build capacity in data science. The data science experimental work initiated by a small group of data scientists in 2017 demonstrated the strong potential of these algorithms to automate at scale and solve problems that, until then, had no viable solution.
Statistics Canada is integrating new data science methods and technologies, with long-standing analytical expertise, to provide better social and economic insights to Canadians and policymakers. Data science methods constitute a wealth of tools and bring a solid addition to traditional methods. To benefit from the opportunities that the open ecosystem of data science offers, the agency is working toward implementing data science at scale, and transforming the organization to fully integrate data science in its day-to-day operation.
Data science helps address evolving expectations in a constantly adapting manner. It enables the integration of nonsurvey data to create new high-value, high-quality, granular, relevant, accessible, and trusted products that reflect the ever-evolving needs of society, and are used by decision makers to make evidence-based decisions. Today, these new approaches integrate data that used to not be considered (e.g., text, image, and sensor data), and can now be exploited across the agency’s statistical programs.
These leading-edge data science tools and platforms enable a new scale of work automation and more efficient service delivery. Therefore, statistical programs are faster and more efficient and resources are freed up for high-value work and expanding the agency’s capacity to meet the growing demand for analytical products.
More specifically, the agency is combining traditional statistics and data science to deliver faster and timelier products to Canadians, reduce response burden on households and businesses, produce more granular and complete statistics, enhance privacy and confidentiality, implement a new scale of knowledge work automation and cost-efficient operations, serve as a new data collection tool, and produce new insights, to thus meet evolving data-driven needs. As these are being implemented, the agency remains very aware of the limits and potential biases of nontraditional data and continues to be active in research (Beaumont, 2020) to support the way forward. National Statistical Organizations are aiming at providing a faithful picture of society. This inferential process has been implemented by means of statistical theory and surveys. With data science, new possibilities are now opening up, but they may not always lead to possible inference. As demands for data are rising, it is convenient that data science methods now bring capacity to answer some of these demands where the need is not necessarily for inference. There is room for experimental statistics and cases of descriptive statistics. For example, at the beginning of the pandemic, there was an urgent need to gather information about the reality of people in terms of COVID-19 cases and COVID-19 death cases, and this was rapidly organized through nonprobability internet collection. Similarly, machine learning techniques might be used with some nonrandom data sets. However, as soon as a need for inference exists, official statistics are produced using a statistical framework that enables measurement of error and valid conclusions.
It is also important to note that often the use of machine learning within a statistical agency is not to replace statistical methods. It is to complement the statistical methods and tools used, where automation is useful, for instance, as a result of receiving larger volumes of digital data. This is usually applied within one or more stages of the statistical production process. Such an approach keeps the program outputs’ quality consistent with previous practices and previous targets. Another element of consideration is that the machine learning methods usually focus on accuracy of predictions as part of manual work automation, while statistical production outputs are focused to ensure quality of estimation and reduce or measure uncertainty. As an example, natural language processing is applied to retail scanner data from one major retailer in Canada to automate the coding of commodities to a standard classification, improving consistency and timelines and resulting in an overall improvement of the quality of the code. The retail scanner data has a greater degree of granularity, which also permits the production of more detailed statistics at commodities level for this retailer. The retail program’s quality indicators were measured and assessed before and after the introduction of machine learning to ensure that the same or better statistical measurements’ quality is produced while maintaining a predefined level of accuracy of the machine learning models. During the production cycle, in the case of using machine learning models for classification, we use quality assurance methods to ensure that a judicious sample of what is coded by the machine is validated by humans to come up with both a quality measure of the coding and a way to detect if the models need to be refreshed.
Quality considerations need to be applied at all stages of data processing, for instance, consideration of input data quality is of great importance, assessing early on if the available data is suitable to meet the predefined goals of the program. In the context of surveys, the literature on quality of statistics is quite important. This is not only for the estimates themselves, but also for nonsampling sources of errors. Concerning potential errors, Biemer et al. (1991) provide a comprehensive list of sources or errors. As new data sources are considered, their quality is not under the control of statistical agencies and require further consideration. For example, Citro (2014) provides a good account of the potential impacts of using data sources other than traditional surveys. Looking at the various data sources as frames, Lohr (2021) presents an approach to design a statistical program using a combination of alternative data sources with a survey that would have some overlap such that adequate calibration can be achieved to produce statistics. With new data sources will come all types of variations of use of surveys and other data. Statistical programs may be survey based or administrative data based, using both survey and administrative data or either (administrative here meaning any data not from surveys).
Through its modernization, Statistics Canada started to pursue an administrative-first paradigm. That is to say that the wealth of information available will be considered first and then surveys will be used for what is missing and to provide a probabilistic anchoring to the administrative data used. This brings issues of how to measure the total quality of the statistical program along the lines of Biemer (2016). Further to this, Rancourt (2018) provides the initial thoughts on what a quality framework could be under an administrative-first paradigm. Still, much development remain to be accomplished and this is why Statistics Canada is moving carefully in this direction.
When inference is not needed, the issues related to the quality of input data (as well as outputs) remains. Statistics Canada (2019) has developed quality guidelines that include elements of quality considerations and measures related to nonsurvey data sources. Moreover, all aspects of data editing and imputation that have traditionally been used from survey also apply to nonsurvey data. Specifically, for the context of machine learning, Statistics Canada (2021) also developed a Framework for Responsible Machine Learning Processes, which has a component about the quality of input data used to train algorithms. Machine learning algorithms are (most often) based on forms of nonparametric models, and how to assess the fit of such models has yet to be fully established. In the meantime, Statistics Canada is developing modeling guidelines with a view to including as much information as possible on assumptions and parameters used with machine learning. Such information is to be made available to users.
As Statistics Canada has started to use alternative approaches and data sources, it has remained transparent. For instance, when crowd sourcing was used during the pandemic to obtain some insights on the population, clear descriptions and limitations of the approach were provided in notes to users. As more modeling approaches are used, whether they be automated like machine learning or explicit models, the Policy on Informing Users of Data Quality and Methodology (Statistics Canada, 2000) is updated and will continue to be followed. Also, users can always send questions that will be answered by the experts that produce the statistical product. Statistics Canada established a Trust Centre where people can find information on all the administrative and alternative data sources that are used by statistical programs. In addition, when a product might have limitations for any reason, such as, for example, if it is based on experimental methods or if there are strong assumptions, then such limitations are spelled out and provided at the time of dissemination of results.
Methodologists and survey statisticians are playing a key role while implementing machine learning by identifying the standards of rigor, ensuring statistically sound methods are used, promoting quality and valid inference when it is needed, and abiding by ethical science practices when deriving insights from data. In the 1990s, Statistics Canada developed a number of generalized systems, incorporating the different advanced methodologies needed to conduct surveys and statistical processes, such as sampling, imputation, estimation, confidentiality, record linkage, time series, or coding. Including machine learning techniques in these generalized systems is natural. For instance, the newest release of G-code includes classification tools using machine learning algorithms such as XGBoost and fastText. In the last years, machine learning methods have been used by methodologists, automating standards and comments classification, evaluating new outlier detection methods, clustering data for imputation, providing nonresponse reweighting or calibration, performing record linkage, generating synthetic data, or automating decisions on the selection of parameters.
In 2019, the agency chose the operational hub and spoke model to implement data science across the organization. The purpose of the central hub is to provide modern methods and digital technology expertise and guidance, and the many smaller hubs spread across the organization leverage expertise in subject matter. This hub and spoke model is a hybrid that many organizations have opted for because it uses both centralized and decentralized operating model principles. In the early days of data science, there was no clarity if data scientists would all be residing within a central team (which is an approach often taken to address organizational needs for a specialized skillset), or if data scientists would be part of teams across the organization (which has many advantages, such as avoiding capacity bottlenecks). More recently, this model is being applied beyond the organizational business models, and into new office design models (Robinson, 2021). In the context of our data science organizational model, the hub plays the role of a nucleus for research and development. The spokes in this model complement the hub and provide in-house domain-relevant expertise, quality assurance, and maintenance of models in operation. The central data science hub is hosted with other modern statistical methods divisions, and delivers expert methodology services in support of statistical operations. Its mission is to build data science capacity within the agency, build trust in these new methods, and create the scientific backbone for data science work. Collectively, data science work within the agency is focused on solving concrete problems and delivering practical results that enable business units to move forward confidently with big, unstructured data.
The data science research and development projects in the agency typically adopt minimum viable product management principles (based on the lean startup philosophy, Ries, 2011). They typically begin with a proof of concept of 4 months, and have a primary goal of maximizing the business value delivered to the end user. Prototypes are frequently delivered to clients and the scope is adjusted as required to maximize the value delivered to the client. To accomplish this, data scientists and data engineers welcome change, continuously iterate and collaborate closely with data analysts across the agency. Depending on the available data, the type of problem to be solved, and, subsequently, the quality of the machine learning models built with this data, the data science projects can take many different paths.
Throughout this process, analysts and machine learning specialists work closely and make decisions about data, outputs, methods, and tools. Due to the degree of uncertainty they face and the high pace of change (often contributed by new technology, methods, or tools), many of these projects require a multidisciplinary team to deal with the complexities and mitigate risks.
Expertise has been developed in the latest open source, high-performance computing hardware and cloud service tools. In the space of data science methods, an advanced modeling expertise has been built in image processing, natural language processing, deep learning, privacy preserving techniques, traceability methods, new record-linkage strategies, information retrieval, and automation. The areas of specialization include supervised learning, unsupervised learning, deep learning, active learning, and reinforcement learning. Leading-edge practices promoting quality, rigor, and ethical science have been identified.
Access to relevant data sources is essential for the approaches described in this article. In Canada, the Statistics Act (1985) enables Statistics Canada to have access to all administrative data. Nonetheless, access is not automatic. In many cases, access to data is governed through memoranda of understanding and other partnership agreements with the government, administrative institutions, or private industry. In addition, the work on the Government of Canada data strategy (2019) calls for further efforts to enable data access across government entities based on solid data stewardship principles. In parallel to the legal and partnership mechanisms, there needs to be social acceptability for the gathering of data. To this end, Statistics Canada has developed a Necessity and Proportionality Framework that aims at ensuring that the acquisition of any type of data responds to clearly stated public needs. Also, the manner in which the data is obtained and the level of details sought must be proportional to such needs. Finally, ethical aspects, including privacy, are always accounted for during the process.
This framework enables a context where new data sources are explored to meet new needs and provide relevant statistical products. With the use of any new data source, there are always processes to ensure that they are of good quality and have the sufficient coverage that warrants a potential statistical use. This starts with the establishment of a strong partnership with the data provider to be in a position where changes to data would be discussed and foreseen. Also, before their use, the quality of data is thoroughly assessed, and when data are considered for a specific program or project, they will undergo a series of editing and imputation steps where necessary.
The rapid spread of COVID-19 and the impact of the pandemic boosted the need for timely data at an unprecedented pace to inform front-line health organizations and support decision-making. This made the timeliness of data more important than ever before, and methods were adjusted to respond to needs triggered by the pandemic. In accordance with its mandate to collect information, promote the avoidance of duplication in information collected, and develop integrated statistics (Government of Canada, 2017), the agency had an opportunity to quickly adopt new data sources and data science methods.
Thanks to data science tools, the agency was able to quickly respond to the rapidly changing economy and social situation brought on by the pandemic. This resulted in the creation of applications in economic statistics, social and health statistics, corporate management, census, and more.
Statistics Canada collaborated with Health Canada, the federal department responsible for national health policy, to address the need for rapid data integration to inform procurement and policy health decisions by visualizing the supply and demand information for PPE. This ensured that there is sufficient PPE to safely meet the needs of Canadians. Before the data visualization could begin, the relevant data needed to be extracted and ingested.
Daily data sources came from provincial or territorial governments, federal departments, and private sector companies hired to source PPE. This data also came in different formats such as Word documents, Excel files, and PDFs, and would therefore require a significant amount of manual work to create standardized reports. To improve this process, data scientists at Statistics Canada created an algorithm that parsed the data into different data entries. The structured data were then presented in a Power BI dashboard that was shared with other government departments to meet their information needs, and to better understand the supply and demand for PPE in Canada. Since the spring of 2020, the project has evolved considerably, and the PPE supply-and-demand information we receive from external organizations is now processed on a monthly basis.
Bayesian modeling methods were used to estimate the effects of COVID-19 nonpharmaceutical intervention (NPI) measures on the daily transmissibility of the virus. The estimations were based on daily COVID-19 hospitalization and death counts. A hierarchical Bayesian model (Flaxman et al., 2020) encoding the presumed stochastic dynamics from infection to hospitalization and death is used to perform the inference. In the fall of 2021, this approach was applied to forecast the impact of reopening on the health care system and to estimate the local COVID-19 hospital burden (Bosa & Chu, 2020).
A data science--powered approach was used to support the Public Health Agency of Canada in the exploration of NPI policies. This was done by applying reinforcement learning to identify the actions that software agents would take in a COVID-19 environment as a result of epidemiological modeling and optimization, with the aim of reducing viral spread. Various open data were combined to build a population of software agents. Agents learned behaviors that minimized the spread of COVID-19 infection via reinforcement learning (Denis et al., 2022). These learned behaviors were then mined to provide insight into which behaviors flattened the curve and reduced the spread of infection.
In a second phase, the work was used to simulate return-to-work scenarios for office buildings of various sizes. Several different office-related variables were considered, including the use of masks, floor capacity, screening mechanisms, maximum meeting size, and so on. By using the modeling results, scenarios were ranked by the number of at-work infection events. A susceptible-exposed-infections-recovered compartmental model was used with epidemiological parameters, which included the age-dependent probabilities of being symptomatic versus asymptomatic upon infection, infection rates, and sampling distributions of durations for different stages of infection (latent period, asymptomatic period, symptomatic period). Output from this work provided suggestions for optimal return-to-work scenarios that could lead to fewer infections and can be considered as options by employers. A Kibana analytics visualization dashboard was also built to allow end-users to perform further analyses.
The use of machine learning is also fundamentally transforming the way statistics are produced. Scanner data are being used in lieu of survey data for larger respondents in retail. Point of sale scanner data are received from several major grocery retailers and are used in the production of the Consumer Price Index, the Retail Commodity Survey, and the Monthly Retail Trade Survey. Since 2018, a machine learning text classification model has been used in production to classify the product descriptions of one such retailer to the North American Product Classification System (NAPCS), and then obtain aggregate sales for each NAPCS and by area. This work supports the organizational shift to nonsurvey data with 350 GB processed per year, and reduces response burden given that one major retailer will no longer be sent the Retail Commodity Survey or the Monthly Retail Trade Survey.
State-of-the-art computer vision methodologies, including machine learning models, are used to detect the start of buildings construction from 100 GBs of satellite imagery. In this project, machine learning models using a semantic segmentation approach U-Net (Ronneberger et al., 2015) are being trained to determine if the construction is residential or nonresidential. This is currently being monitored through a survey. The long-term goal is to replace, in part, the existing survey.
Currently, greenhouse areas and greenhouse production are assessed by traditional census and survey methods, which are costly and impose a significant response burden. Machine learning work is underway to automate the identification of greenhouse areas using machine learning techniques applied to satellite or aerial imagery with the intent to replace or significantly reduce the need to rely on direct contact with respondents.
Monitoring the production of farms in Canada is an important but costly undertaking. Surveys and in-person inspections require a large amount of resources. For these reasons, the agency is modernizing crop type detection using machine learning for image classification. Crop types are predicted using satellite imagery and the application of Bayesian ResNet convolutional neural networks. The results show that this approach is much faster and reduces the survey response burden for farm owners, especially during the busy times of the year. An extremely efficient data ingestion pipeline has been built to automate the ingestion, preprocess the satellite images, and run the models. The models are being scaled up to cover provinces across Canada with the goal of transitioning this work to production to reduce response burden by eliminating parts of Statistics Canada’s Field Crop Survey.
Extracting information from PDFs and other documents can be a time-consuming process. Statistics Canada has been applying data science to overcome this challenge. As an example, the agency performed experimental work with the historical data set of the System for Electronic Document Analysis and Retrieval (SEDAR), which is used by publicly traded Canadian companies to file securities documents to various Canadian securities commissions. The SEDAR database is used by agency employees for research, data confrontation, validation, frame maintenance process, and more. The extraction of information from public securities documents such as financial statements, annual reports, and annual information forms is currently done manually. The documents are in different formats, frequently in the range of 100 pages, with variables usually contained in tables. To increase efficiency, data scientists developed a state-of-the-art graph neural network algorithm that correctly identifies and extracts key financial variables (e.g., total assets) from the correct table (e.g., balance sheet) in an annual financial statement of a company (PDF document). They also transformed a large amount of unstructured public documents from SEDAR into structured data sets, allowing for automated extraction of information related to Canadian companies. An interactive web application has been built for analysts to visualize and automatically extract variables for many purposes.
A data scientist at Statistics Canada has worked with researchers to dynamically detect emerging causes of death. For this, applications were developed to automatically extract information from narratives belonging to the Canadian Coroner and Medical Examiner Database (CCMED). This database contains a narrative variable that gives information on the cause of death, which can vary in detail and length and may contain short forms, abbreviations, and spelling mistakes. This work uses latent Dirichlet allocation (LDA), a natural language processing method, to detect topics and changes in patterns of death over time. This method could have been used to detect opioids as an emergent topic starting in 2016.
The United Nations Economic Commission for Europe (UNECE) has organized a number of working groups focused on data science and machine learning. For instance, the UNECE (2021) paper provides a summary of the results from two international initiatives: the UNECE High-Level Group on Modernisation of Official Statistics (HLG-MOS) Machine Learning Project (2019–2020) and the United Kingdom’s Office for National Statistics (ONS)–UNECE Machine Learning Group 2021. More information about these groups can be found at https://statswiki.unece.org/display/ML.
Currently, the agency’s objective is to increase the use of data science within the statistical production. Building a strong research and development capacity in data science is critical in order to leverage the agency’s vast data resources and expertise in mathematical methods for the production of official statistics.
Ethical use of machine learning is of primary importance. This starts with the input data and continues throughout the data life-cycle. During all steps, privacy and confidentiality principles are central. To this end, a Necessity and Proportionality Framework was developed in consort with the Office of the Privacy Commissioner. The framework is used by statistical programs when they acquire new data to ensure decisions are made while balancing the societal needs for official statistics (necessity principle) with the need to reduce the response burden on Canadians, all while protecting the privacy of their information (proportionality principle). Further, the data management governance was strengthened and an external ethics advisory committee was set up to complement the guiding role played by the Canadian Statistical Advisory Council. In terms of the outputs of a machine learning process, they should not cause unfair bias or discrimination, should prevent harm to vulnerable populations, minimize privacy risks, and create value for Canadians. For this purpose, in 2019, a Framework for Responsible machine Learning Processes was developed. This framework ensures that before projects transition into production, they are assessed around four principal axes: respect for people, respect for data, sound application, and sound methods. The objective is to develop applications that follow standards created by the Treasury Board Secretariat’s Directive on Automated Decision-Making. Furthermore, when a project moves into production, a quality assurance process is designed using methodological support. This process will continuously assess the relevance of a machine learning model that has been developed and will identify when the models need to be refreshed.
Investments in digital infrastructure, statistical infrastructure, and data science are all key components that allow organizations like Statistics Canada to strengthen further their data stewardship role. Having access to high-quality data, metadata, and learning data are determining factors for the success of data science initiatives. In addition, data management considerations must be accompanied with considerations for code and model management. Best practices from DevOps, DataOps and MLOps are being introduced and used by data scientists. A significant portion of data science projects are conducted in a scalable cloud environment, which is part of Statistics Canada’s Data Analytics as a Service platform (DAaaS). The workbench integrates tools for continuous integration and continuous development (CI/CD), which allow for scalable and reproducible data pipelines, as well as advanced data and model management capabilities. Knowledge and expertise are shared via communities that share best practices as well as information on projects, both internally and with partners across the government.
In September 2020, Statistics Canada launched the Data Science Network for the Federal Public Service to provide a platform that propels forward the exploration and the application of data science to real business problems within the Federal Public Service, in partnership with other federal departments. This platform allows data scientists to share best practices and leading-edge methods, and aims to develop large-scale partnerships between departments and with academia.
National statistical offices need to adapt to the new digital reality and embrace nonsurvey data sources along with data science methods to better serve society. Statistics Canada is making this shift, using methods and tools for data analytics and building data pipelines to extract insights from data. As shown in the many examples provided in this article, this can be achieved by classifying quickly large amounts of scanner data, identifying the start of buildings construction or greenhouses, along with detecting crop type using satellite imagery, or simply by developing models to help inform COVID-19 decisions. Data science can also be used to automate the extraction of economic information from thousands of reports, allowing analysts to concentrate on their main task of understanding the economy. Finally, using data science to provide researchers with new tools to conduct their study is opening the door to new findings.
The agency has put in place a number of building blocks to make this shift happen effectively by implementing a hub and spoke model across the organization, developing the vision, the relevant research areas and expertise, and ensuring that production can be sustained. The Framework for Responsible machine Learning Processes as well as the development of best practices for its data scientists will ensure that data science is done in a fair, reliable, and ethical manner. Hiring, retention, and larger scale training in data science are all challenges that are now being addressed and will help further enable this shift to data science.
This article shows that, as a trusted leader in data analytics, the agency is adopting new methods and tools such as data science to benefit Canadians and support better decisions through data, along with algorithmic accountability and ethical and responsible use of methods, while continuing to pursue valid inference, fairness, and sound standards and practices.
This paper and the related data science strategy would not have been possible without the strategic direction and modernization agenda set out by Anil Arora, Chief Statistician of Canada.
We also thank the screening editor and reviewers for their insightful comments that helped increase the paper’s relevance for the readership of Harvard Data Science Review.
Sevgui Erman, Eric Rancourt, Yanick Beaucage, and Andre Loranger have no financial or non-financial disclosures to share for this article.
Arora, A. (2018). Modernizing the national statistical system–Stakeholder consultations. Statistics Canada.
Beaumont, J.-F. (2020). Are probability surveys bound to disappear for the production of official statistics? Survey Methodology, 46(1), 1–28. https://www150.statcan.gc.ca/n1/pub/12-001-x/2020001/article/00001-eng.pdf
Biemer, P. P. (2016). Total survey error paradigm: Theory and practice. In C. Wolf, D. Joye, T. W. Smith, & F. Yang-Chih (Eds.), The SAGE Handbook of Survey Methodology. SAGE.
Biemer, P. P., Groves, R. M., Lyberg, L. E., Mathiowetz, N. L., & Sudman, S. (Eds.). (1991). Measurement errors in surveys. Wiley.
Binder, D. A. (1983). On the variance of asymptotically normal estimators from complex surveys. International Statistical Review, 51(3), 279–292. https://doi.org/10.2307/1402588
Bosa, K., & Chu, K. (2020). Ottawa COVID-19 hospital occupancy forecasts–Final report. Report shared with Health Canada collaborators.
Citro, C. F. (2014). From multiple modes for surveys to multiple data sources for estimates. Survey Methodology, 40(2), 137–161. https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2014002/article/14128-eng.pdf?st=gbWriPFq
Denis, N., El-Hajj, A., Drummond, B., Abiza, Y., & Gopaluni, K.C. (2022). Learning COVID-19 mitigation strategies using reinforcement learning. In V. K. Murty & J. Wu (Eds.), Mathematics of public health (Vol 85, 251–271). Springer, Cham. https://doi.org/10.1007/978-3-030-85053-1_12
Fellegi, I. P., & Hold, D. Holt. (1976). A systematic approach to automatic edit and imputation. Journal of the American Statistical Association, 71(353), 17–35. https://doi.org/10.2307/2285726
Flaxman, S., Mishra, S., Gandy, A., Unwin, H. J. T., Mellan, T. A., Coupland, H., Whittaker, C., Zhu, H., Berah, T., Eaton, J. W., Monod, M., Imperial College COVID-19 Response Team, Ghani, A. C., Donnelly, C. A., Riley, S., Vollmer, M. A. C., Ferguson, N. M., Okell, L. C. & Bhatt, S.. (2020). Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. Nature, 584, 257–261. https://doi.org/10.1038/s41586-020-2405-7
Government of Canada. (2019). Report to the Clerk of the Privy Council: A data strategy roadmap for the federal public service. https://www.canada.ca/en/privy-council/corporate/clerk/publications/data-strategy.html
Lee, H., Rancourt, E., & Särndal, C.-E. (2002). Variance estimation from Survey Data under Single Value Imputation. In Groves, R.M., Dillman, D. A., Eltinge, J. L., & Little, R. J. A. (Eds.), Survey nonresponse (pp. 315–328). J. Wiley and Sons.
Lohr, S. L. (2021). Multiple-frame surveys for a multiple-data-source world. Survey Methodology, 47(2), 229–263. https://www150.statcan.gc.ca/pub/12-001-x/2021002/article/00008-eng.pdf
Rancourt, E. (2018). Admin-First as a Statistical Paradigm for Canadian Official Statistics: Meaning, Challenges and Opportunities. Session 3A on Wednesday, November 7, 2018, Ottawa Canada. In Proceedings of Statistics Canada Symposium 2018. https://www.statcan.gc.ca/eng/conferences/symposium2018/program/03a2_rancourt-eng.pdf
Rao, J. N. K., & Molina, I. (2015). Small area estimation. Wiley Series in Survey Methodology. Wiley.
Ries, E. (2011). The lean startup. Crown Business.
Robinson, B. (2021, January 9). ‘Hub-and-spoke’: The new office model of the future, expert says. Forbes. https://www.forbes.com/sites/bryanrobinson/2021/06/09/hub-and-spoke-the-new-office-model-of-the-future-expert-says/?sh=75f90aad2732
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, & A. F. Frangi (Eds.), Lecture notes in computer science: Vol. 9351. Medical Image Computing and Computer-Assisted Intervention (pp. 234–241). Springer. https://doi.org/10.1007/978-3-319-24574-4_28
Statistics Act. (R.S.C., 1985, c. S-19). Statistics Act (justice.gc.ca).
Statistics Canada. (2000). Policy on informing users of data quality and methodology. Statistics Canada. https://statcan.gc.ca/en/about/policy/info-user
Statistics Canada. (2019). Statistics Canada quality guidelines–Guidelines for ensuring data quality. Statistics Canada.
Statistics Canada. (2021). Framework for Responsible Machine Learning Processes at Statistics Canada. Statistics Canada. https://www150.statcan.gc.ca/n1/pub/89-20-0006/892000062021001-eng.htm
United Nations Economic Commission for Europe. (2021). Machine learning for official statistics. https://unece.org/sites/default/files/2022-01/ECECESSTAT20216.pdf
Yung, W., Karkimaa, J., Scannapieco, M., Barcarolli, G., Zardetto, D., Alejandro Ruiz Sanchez, J., Braaksma, B., Buelens, B., & Burger, J. (2018). The use of machine learning in official statistics. United Nations Economic Commission for Europe, Machine Learning Team.
Zinsmeister, S., Yeung, A., & Garrett, R. (2019). AI-driven analytics–How artificial intelligence is creating a new era of analytics for everyone. O'Reilly Media. https://www.oreilly.com/library/view/ai-driven-analytics/9781492055785/
©2022 Sevgui Erman, Eric Rancourt, Yanick Beaucage, and Andre Loranger. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.