Stats NZ established the Integrated Data Infrastructure (IDI) to enable longitudinal research into the causes and correlates of social outcomes for New Zealanders. The IDI was established in 2011, building off earlier experiences linking employer and employee data (which itself subsequently formed the basis of another linked database known as the Longitudinal Business Database [LBD]). Data in the IDI form an ‘ever-resident’ Aotearoa New Zealand population of around nine million people. Access to the IDI is available for bona fide researchers so long as they meet established criteria to show that they will keep the data safe and work with it in a culturally appropriate way. This article gives a brief history to the development of the IDI (and the LBD) and shares some lessons learned along the way for the benefit of others starting down this integrated data journey.
Keywords: integrated data, data science, longitudinal research
Globally, interest in how to better use data to inform policy and reform and foster innovations in public services has grown significantly over recent years. Globalization, digitalization, demographic and climate changes are transforming the way economies work, providing new opportunities for growth, but at the same time creating the risk of deeper inequalities if the gains from growth are not evenly shared. Public policy needs to keep pace with these rapidly changing social and economic opportunities and risks. There is an increasing awareness of the significant potential for data and technology to help inform more adaptive public policy, and Aotearoa New Zealand has begun harnessing this potential using integrated data.
With the implementation of its Social Investment (first introduced with the Investment Approach to Welfare in 2010/11) and Wellbeing approaches,1 Aotearoa New Zealand has been well ahead of this global curve. These data-reliant approaches to public policy design have helped to demonstrate the power of linked data, evidence, and science to create richer insights and support better decision-making. Even before the introduction of Social Investment, government agencies were experimenting with using linked data to understand the effectiveness of government interventions across agency boundaries. For example, the Ministry for Social Development had been using linked data to observe a broad set of social outcomes for people leaving social welfare benefits for many years.
Stats NZ has been at the forefront of some of these national efforts, particularly the use of integrated administrative data for research purposes. Administrative data is the data collected by government and private agencies while conducting their business or providing a service, such as information from academic enrollment forms, marriage and death registries, tax forms, and hospital admissions. As Aotearoa New Zealand’s national statistical office, Stats NZ collects and holds a tremendous amount of data relating to Aotearoa New Zealand’s people, its economy, environment, and society. Some of these data are used to produce official statistics that help government, the academic community, businesses, the indigenous Māori population, and the broader population to make good decisions. For the last quarter of a century Stats NZ has also been on a journey to leverage wider benefit from the data underpinning this official statistics system, with a focus on data integration.
As the history following will briefly overview, Stats NZ first began to integrate data in earnest in 1997. There were several advances following this, but integration capability reached a new level with the development of a prototype Integrated Data Infrastructure (IDI) in 2011. Through a series of iterations, what started as ‘a working prototype’ has now become ‘a prototype working’ and has grown to become a critical part of government operations in Aotearoa New Zealand.
In this article, we will outline some of the origins of Aotearoa New Zealand’s integrated data journey, overview the infrastructure we now have in place, some of the innovative ways this infrastructure is being used to shape policy and inform government operations, and give some lessons we have learned along the way. We hope that these experiences will be useful for other jurisdictions, but note that Aotearoa New Zealand has unique characteristics that may not be transferable to other contexts. It is a country of around five million people, with a unicameral parliamentary system and two levels of government (national and local). Most services are either administered at the national level, or systems have been put in place to capture national-level data. There will be complexities in other jurisdictions, particularly those with federated government structures where administrative data is governed at the subnational level. The applicability of our approach to other jurisdictions is discussed further in the Discussion section.
Aotearoa New Zealand had many advantages that helped to accelerate our progress with integrated data. From the mid-2000s there was a strong direction from government for more data-informed policy advice. The then Minister of Finance and future Prime Minister, Hon. Bill English, had a particularly strong focus on using information to better understand the needs of vulnerable populations, adjusting services to address those needs, and using quantitative insights to measure the effectiveness of those services. These insights were then used to inform the next set of investment decisions. This policy approach became formally known as the Social Investment through the term of that government. This approach and the report on evidence-based policy to deliver Better Public Services2 required improved capability across government to share and use existing data sets.
Modeling the Performance of the Benefit System
From 2014 onward, as part of the Social Investment approach, the Ministry of Social Development (MSD) was routinely publishing reports on the performance of the benefit system. These reports took an actuarial approach to measure the forward liability associated with the welfare system.
The government of the day used this liability measurement to focus appropriations on services that would help people become independent of the welfare system. Early analyses showed that policy and operational changes had a positive impact on reducing the number of people who are dependent on welfare. For example, the 2013 benefit performance report identified that changes requiring reapplication for unemployment benefit every 52 weeks and placing work obligations on particular benefit categories led to a shorter average time spent on benefit (by around 1.5 weeks; Raubal & Judd, 2014).
Early valuation reports identified that linking to data from other agencies (e.g., education, care and protection, and youth justice) would help to better understand early benefit entrants’ vulnerability to long-term welfare dependency to base liability calculations, as well as segmenting the client base to better target appropriate services.
Later valuation reports used the Integrated Data Infrastructure to understand longer-term nonbenefit outcomes to feed into monitoring efforts. For example, until the IDI was used for this purpose, little was known about the reasons for long-term declines in single-parent payments because people are not obligated to keep MSD informed of their status once they have left the benefit system. Research identified that a much higher proportion of people exiting these payments do so because they have partnered up, compared to other benefit categories (19% compared to about 1% for job seekers). The proportion of those on single-parent payments who exited due to finding employment also increased in 2013/14 relative to 2010/11 (43% cf. 37%). This was likely related to strengthening of work obligations as part of welfare reform around that time (Raubal & Judd, 2018).
This example helps to understand the importance of integrated data as a tool for evaluation of Government policy.
These early analytics-driven approaches to policy development led to the establishment of the Social Investment Agency (SIA, now the Social Wellbeing Agency, or SWA) to take a functional lead in this work for government. A growing community of data scientists and other quantitative analysts supported the provision of that advice and, most critically, this approach was assisted by an underpinning integrated social data infrastructure, the IDI. Analysts were able to relatively quickly leverage the IDI to demonstrate the value of linked administrative and, more recently, survey data to create evidence-based advice for senior decision-makers. The concentrated effort to make better use of data created a more visible group of data scientists across government. These data scientists and analysts operated both within and outside of government. The academic community increasingly worked with and for government departments to get value from the integrated data.
Social Investment introduced a different approach to decision-making that put greater use of integrated data at the heart of policy and service design and measurement. Integrated data at the individual level provides markers of changes in peoples' lives. Understanding the impact of government decisions on individuals and families creates feedback loops between government investment, service delivery, and the changes in people’s lives in a way that had not been possible before without integrated data and the IDI.
Many of the wellbeing outcomes that social interventions target are difficult to measure and have traditionally been considered only through qualitative evidence or limited proxy measures. Improving the evidence base on the broader wellbeing impacts of social interventions involves using new analytical techniques and making the best possible use of existing data. Early on, the SIA wanted to test the feasibility of using both existing techniques and creating new (reusable) ones in the IDI for social investment and wellbeing analysis.
As a proof of concept, one of the objectives of the test case was to establish what can be done in the IDI using social survey data. The first proof-of-concept chosen was an analysis of New Zealanders’ experiences of government-subsidized housing. Many lessons were learned. Survey and administrative data are complements, not substitutes, in the IDI. Survey data can capture concepts that are unlikely to ever be reflected in administrative records (e.g., the temperature of a house) and can provide measures of outcomes (e.g., life satisfaction) rather than merely reflecting the services delivered (e.g., whether a person was in government-subsidized housing). Both features make survey data especially valuable in evaluating the effectiveness of social policy interventions—where it is difficult to obtain good outcome measures from service usage.
The New Zealand General Social Survey (NZGSS) is particularly valuable because most of the data it obtains has no direct equivalent in the IDI. The NZGSS focuses heavily on social outcomes that are not reflected in administrative data sets. Information on people’s subjective mental states, social connections, levels of trust, or sense of safety are currently not possible to obtain from administrative data, and in many cases never will be. However, measures of this type are of immense analytical value in evaluating social policy interventions.
Using the Integrated Data Infrastructure (IDI) to Understand the Impact of Providing Government-Subsidized Housing
In 2017, the Social Investment Agency (SIA) crunched the numbers on government-subsidized housing in the IDI to both demonstrate the Social Investment approach and help understand the benefits of government-subsidized housing. While the test case focused specifically on fiscal outcomes (i.e., was it possible to calculate a fiscal return on investment for a given social investment), it was the first step in a broader program of work aiming to understand where most of the costs and benefits of living in a social house accrued across government.
The test case was successful in identifying the change in government spending associated with placement in government-subsidized housing for different government agencies, however, the results were not meaningful in terms of evaluating the effectiveness of social interventions. For example, being placed in a social house was associated, on average, with higher future education spending on the part of government. However, whether this represents a good outcome or not depends on what is driving the expenditure. If the higher expenditure reflected better school attendance from children placed in government-subsidized housing this would clearly be a good outcome, but if it reflected increased need from children whose social networks have been disrupted by placement in government-subsidized housing, then the higher spending could reflect poor outcomes. Analyses conducted in the aggregate can obscure individual effects and effectiveness of public services.
The test case recommended further work to go beyond the fiscal impacts to understand the impact on people’s wellbeing, acknowledging this is fundamental to assessing the effectiveness of government-subsidized housing interventions. The second test case applied the SIA’s wellbeing measurement approach3 to linked cross-sectional data from the New Zealand General Social Survey (NZGSS) with longitudinal Housing NZ administrative data in the IDI. The aim was to test if it was possible to identify the impact of being placed in a social house on a recipient’s housing outcomes, other important wellbeing outcomes (e.g., health and education), as well as overall wellbeing.
The primary purpose of the test case was to apply the SIA’s wellbeing measurement approach to a real policy issue, and there were a number of limitations to the analysis, but the findings included some interesting insights. Housing conditions generally improved for people placed into government-subsidized housing, and although people’s overall wellbeing improved when they moved into government-subsidized housing, perceived safety deteriorated. This effect persisted in that people living in government-subsidized housing had lower perceived safety than people in any other housing circumstances. There was no statistically significant change in the other wellbeing outcomes.
To date, our integrated data program has focused mainly on its use as a research tool to understand complex social and economic issues in more depth, to inform policy, to help with the targeting of resources, and for impact evaluation. There are also emerging use cases where the statistical outputs from the IDI have been used to directly allocate resources and to respond to emergency events. As such, it has become an integral component in the decision-making processes of central government.
There are countless examples of innovative uses of integrated data, but three recent illustrations are representative of recent trends in use (outlined in the following breakout box).
COVID-19 Complex Contagion Model (e.g., Hendy et al., 2021)
At the beginning of the COVID-19 pandemic, researchers from the University of Auckland undertook a research project using the Integrated Data Infrastructure (IDI) where they built an individual-based network that is representative of the NZ population, including schools, workplaces, family structure, community groups, movements between and within cities, and so on. This network model was used as the basis to run detailed complex contagion models including spatial and occupational effects for the COVID-19 pandemic under different scenarios.
This project was regularly submitting outputs from the Data Lab directly to the National Crisis Management Centre. The models garnered a high level of media interest with one of the project leads, Prof. Shaun Hendy, regularly speaking to media about the model’s findings. The detailed population data in the IDI was key to this project as it enabled the researchers to parameterize the synthetic networks combined with what they knew about the topological properties of real-world networks.
Modeling Social Outcomes (Morris & Sullivan, 2015)
The Ministry of Social Development (MSD) uses the IDI to build and develop the Social Outcomes Model. This is a detailed model that gives a broad view of what happens to people today and what might happen to people in the future, which helps MSD understand people’s current needs and what their needs may be in the future so they can support them with better services. This model looks at income, housing, health, and wellbeing.
Outputs from this project have been feeding into policy development for MSD that ensures the right support is reaching the right people and that funding is distributed effectively across support services. The IDI is the only tool that makes this project possible, as it requires linked data for a broad range of factors that covers the whole population, and the IDI is the only place where this is available.
Using Integrated Data to Better Target School-Based Equity Funding (Ministry of Education, 2019)
The negative effects of socioeconomic disadvantage on educational achievement are well-documented. Most Western countries, including Aotearoa New Zealand, fund schools with higher concentrations of disadvantaged families at a higher rate to mitigate the impact of this disadvantage.
In Aotearoa New Zealand, this has been done by looking at the socioeconomic characteristics of the neighborhoods where children live, aggregating this to the school level, and then ranking schools according to their decile of disadvantage. Different funding rates are applied depending on the schools’ decile ranking. This method works reasonably well but there are some distortions within the system. For example, because there is a high degree of school choice in Aotearoa New Zealand, schools that attract the wealthiest children from the poorest neighborhoods have artificially inflated disadvantage scores.
To improve this system, the Ministry of Education developed a student-focused statistical model based on IDI data to identify relationships between individual measures of disadvantage and a measure of educational success. The model found that previous decile-based calculations were a blunt instrument for identifying the level of disadvantage in a school. There are very large variations in the proportion of disadvantaged students within a given school decile ranking. The student-based measure was found to be a much more accurate way of targeting funding based on the actual level of disadvantage within a given school. The Aotearoa New Zealand government invested an initial tranche of funding to transition to this new funding model in the 2021 national budget.
These are but some of the many current uses of the IDI, which continue to grow year-on-year. There are currently 350 active projects in the IDI, with 815 researchers attached to these projects. Since current record-keeping began in 2016, 1,770 researchers have gone through the confidentiality training required before access to the Data Lab—a secure virtual environment providing access to IDI data—is approved. The ‘working prototype’ really has become ‘the prototype working.’
Stats NZ has two integrated databases containing de-identified longitudinal microdata. The IDI contains data about people, while the Longitudinal Business Database (LBD) contains data about businesses.
The IDI is built around a central linking ‘spine’ that all other data sets are linked to. The concept underpinning the spine is to get the largest population coverage with the fewest data sources. This is important because the spine is essentially rebuilt whenever the IDI is refreshed (currently three times per year). Introducing more complexity into this initial linking process tends to increase the variability in linking outcomes from one refresh to the next.
Each ‘refreshed’ IDI is a static database that never gets changed except in very rare and special circumstances (e.g., the team later decides to remove a variable for privacy reasons, or someone notices a significant error that is crucial to fix). Each refreshed IDI database is saved and archived, so if a researcher wants to access a specific version (say, to test reproducibility of results) they will be able to access the same database that was originally used for the study and analysis they are interested in. In this way we can guarantee reproducibility of research using the IDI.
The IDI spine is created by linking births, tax records, and migration data to create an ‘ever-resident’ population.4 Once the spine has been created, the remaining data sources are linked to the spine on a one-to-one basis. The remaining data sources are not usually linked with one another to avoid an overly complicated tangle of links that are not required for most analytical purposes. The data sources that are linked to the spine include information about:
Health (e.g., universal health screening prior to school entry, hospital discharge, immunization, public health enrollments)
Social welfare and social services (e.g., receipt of social welfare payments, injury claims, experiences within the child protection system)
Education and training (e.g., early learning participation, school enrollment and achievement, tertiary education outcomes)
Receipt of government-subsidized housing
Income and employment
Justice (e.g., victimization, legal prosecutions, sentencing)
People and communities (e.g., driver licensing, labor force, and social surveys conducted by Stats NZ) and,
Population data (e.g., previous Censuses of Population and Dwellings, visa applications, births, deaths, and marriages).
While the analysts who create the spine require access to identifying information to create the linkages, all data sets are stripped of information that could be used to identify individual people before being accessed by researchers. Stats NZ has strict protocols in place to check data before being outputted from the IDI to ensure that there is no spontaneous recognition of people in statistical outputs. These safeguards are outlined in more detail below.
The Longitudinal Business Frame is a longitudinal representation of the Business Register maintained by Stats NZ. The Business Register is the primary frame used to identify businesses sampled for our business activity surveys each year. The Longitudinal Business Frame forms the backbone of the LBD and captures longitudinal information on all economically significant firms operating in Aotearoa New Zealand since 1999.
The information held within the LBD includes location, industry, business type, institutional sector, and parent–subsidiary relationships. Linked to this backbone is a range of additional data sets based around agriculture (NZ’s largest economic sector), international trade, business practices, financials, and information on business innovation (e.g., research and development surveys and government assistance provided to stimulate innovation). The LBD is currently refreshed annually.
The IDI and the LBD are themselves linked via employee tax information. This enables an extremely rich potential to understand the longitudinal relationship between an individual’s socioeconomic circumstances, their journey through employment and work, and the resulting economic activity at the business level. For example, linkage at this level has enabled researchers to examine drivers of the gender wage gap (Sin et al., 2017), labor market impacts of technology change (Allan & Sanderson, 2021), whether business success is passed on to employees (Allan & Mare, 2021), or the impact of employment ‘trial periods’ on hiring patterns and stability of employment relationships (Chappell & Sin, 2016).
The integrated data program is maintained by a central unit of approximately 40 staff within Stats NZ. The unit has the following functions:
Data brokering to support integrated data products.
Production, responsible for the addition of new and updated data to the IDI and LBD, rapid data loads (out-of-cycle data inclusions), regular database updates, and maintenance of metadata.
Service development, responsible for advisory and operational policy for integrated data services. This function liaises with the executive layer of Stats NZ and with external parties working on service performance, supporting strategic change in the organization, new and enhanced customer offerings, engagement with the user community, and secretariat functions for these engagements.
Data labs, responsible for operating services that provide specialist access to data sources, including the IDI and LBD, facilitation of research inquiries, as well as the checks on data leaving the service (i.e., output-checking).
Stats NZ uses two types of linking to create the IDI and the LBD. The IDI spine is created using probabilistic linkage, by linking tax data to births data, births to visa data, and visa to tax data based on identifying variables common to these data sets (date of birth, first and last name, sex, and address). Where there are unique identifiers available to link additional data sets into the IDI, these unique identifiers are linked directly.
Probabilistic linking means that we match two records based on a probability that they are indeed the same person. Our integrated data are primarily used for research purposes and are not used to make decisions about individual people, which affords us a higher tolerance for imperfect matching. In the March 2021 IDI refresh, linkage rates with central government agencies ranged from around 80% up to a high of 98.3%. Linkage rates tend to be much higher where data quality is essential for the operation of the government services upstream from the IDI (e.g., the 98.3% linkage rate was for driver licensing and motor vehicle registration data). We use a reasonably conservative matching process and false positive rates (where two people are incorrectly identified as being the same person) are lower than 2% for almost every data set linked into the IDI.
Another way of assessing the quality of the data within the IDI is to construct an estimated resident population from administrative IDI records (an ‘IDI-ERP’) and compare this to the estimated resident population derived from the five-yearly Census of Population and Dwellings where we attempt to enumerate every person in the country through household survey collection. The methods for selecting the IDI-ERP have been iteratively improved, with the most recent adjustments implemented in the use of the IDI-ERP in the 2018 Census of Population and Dwellings. Comparisons against an interim 2018 benchmark population showed that the IDI-ERP distributions tracked closely to the benchmark for distributions by age and sex nationally and for ethnic groups and territorial authorities (Stats NZ, 2019a; 2019b). While the analysis is yet to be finalized, initial estimates suggest that the total IDI-ERP is likely to be around 1% lower than the official ERP in 2018. This suggests that the probabilistic linkage methods that the IDI uses are sufficiently robust that the administrative census only differs by around 1% from the official census population count. This shows that the probabilistic linking is reliable.
The success of our integrated data program is entirely dependent on the safe and ethical use of the data it contains. Stats NZ has two frameworks to guide decision-making in this area. The first is known as the Five Safes Framework.5 All applicants and their proposed research must meet the ‘Five Safes’ conditions before access to the IDI is granted:
Safe people—Researchers are vetted and must commit to use data safely before they can access the data. They must pass referee checks, attend confidentiality training, sign a contract where they agree to follow our rules and protocols, have capability to use the data, and sign a declaration of secrecy under the Statistics Act 1975.6 The declaration is a lifetime agreement to keep data confidential. Researchers who break our protocols can be banned, blacklisted, or prosecuted.
Safe projects—To gain access to integrated data, researchers must have a project they can demonstrate is in the public interest. Research projects must focus on finding insights and solutions to issues that are likely to have a wide public benefit. The IDI and LBD cannot be used for individual case management, such as making decisions about a specific person or family.
Safe settings—A range of privacy and security arrangements are in place to keep data physically safe. Data can only be accessed through a secure virtual environment known as a Data Lab, and only in research facilities approved by Stats NZ. A variety of additional security layers also protect the information, including locating the IDI and LBD on a separate server that is not connected to the internet, computers that are not connected to a network, and a lack of USB ports or printing facilities so users cannot take information in or out of the Data Lab without it being checked first by Stats NZ staff.7
Safe data—People’s identity is protected. Data held within our integrated data environment have had identifying information removed, and researchers only get access to the data they need. For example, a researcher granted access to the IDI will only have access to the specific data sets they need for their research project; they cannot see all information in the IDI. Data that is available to researchers is de-identified. Numbers that can be used to identify people are encrypted.
Safe output—All information is checked to ensure it does not contain any identifying results. Researchers must confidentialize their results. We then check all outputs before they can be released from the Data Lab, to ensure information is grouped in a way that makes it impossible to identify individuals. Results that could potentially identify individuals are not released under any circumstances. We provide guidance to give researchers the methods and rules they need to confidentialize their results.
The Five Safes Framework has been extremely effective in keeping data safe from privacy breaches and for ensuring Stats NZ maintains the social license to continue operating the program. However, we are also acutely aware that people’s expectations around how their data are used are changing. High-profile cases of unethical use of personal information such as the Cambridge Analytica scandal have honed people’s attention to ethical use of their information, not just the need to keep it private.
In Aotearoa New Zealand we have an additional layer of consideration—to ensure our use of data meets the expectations arising from the agreement the government has with the indigenous Māori people, or tangata whenua (people of the land). For many Māori, data are a taonga (an item of significant cultural value that needs to be treasured). Objects that are taonga are afforded specific protections under Te Tiriti o Waitangi, the Treaty signed by the Queen of England and most Māori Rangatira (tribal chiefs) in 1840. Te Tiriti is the founding document of modern Aotearoa New Zealand and sets out certain expectations of the government along with rights and protections for Māori.8
As with other parts of the world impacted by colonization, there is a growing movement toward data sovereignty in Aotearoa New Zealand. Data sovereignty usually refers to an understanding that data are subject to the laws of the place where they are collected and stored. In our context, it is also increasingly taken to mean that data about Māori are subject to Māori governance.
Working with Associate Prof. Māui Hudson (University of Waikato), Stats NZ has developed a data access framework to move us closer to realizing this ambition. The Ngā Tikanga Paihere framework was developed in 2018 with the intention to guide appropriate use of microdata in the IDI, focusing on how data about Māori and other underrepresented populations are used for research purposes. The framework draws on 10 tikanga (Māori world concepts), which align with the Five Safes Framework and guide researchers to understand the cultural competencies and practices they will need to consider in conducting research involving priority populations. Ngā Tikanga Paihere encourages researchers to consider the people behind the data and the experiences and context for them. Researchers are required to plan more around the way they engage with relevant community groups and the advice they seek in ensuring their research is culturally sensitive, as well as ensuring that important and useful findings from the research reach the relevant community groups in ways that make sense to the community of interest. The tikanga are not described in detail here, but examples include:
Pūkenga | skills and expertise—this creates an expectation that researchers can demonstrate their intention to work in culturally appropriate ways and that they will establish relationships with relevant communities before undertaking research.
‘Tapu’ and ‘noa’ | ‘sacred’ and ‘ordinary’—this tikanga recognizes that some data are of such sensitivity that risks need to be carefully managed when using those data, while other data may be much more readily accessible. These forces need to be carefully considered and weighed up by researchers in their use of data relating to Māori and other populations of interest.
Stats NZ actively uses Ngā Tikanga Paihere to assess whether research projects have the appropriate cultural safeguards in place to conduct research in a way that will be beneficial to Māori and other priority populations (e.g., Pacific Island communities or people with disabilities). Response to the introduction of Ngā Tikanga Paihere has been generally positive. It has highlighted for some research groups the need to grow their cultural capability as well as expand the types of engagement and involvement that community groups have within the research undertaken in the IDI.
Some research projects require adding new data into the IDI for linking to the broader database. For Stats NZ to consider adding data to the IDI, research applicants must also go through our Data Ethics and Privacy Assessment (DEPA). The DEPA is based on the Ngā Tikanga Paihere framework and the Privacy Act 20209 and is important for every data integration project. It helps Stats NZ to assess and mitigate impacts on privacy, ensure that data are treated in responsible and culturally appropriate ways, and assess if a proposed data integration is consistent with relevant legislation and consistent with Stats NZ’s data integration principles. The DEPA form asks for information on the source of the data and collection method(s), the original intended use, wider benefits of the data, legal and privacy obligations, sensitive variables, cultural responsiveness, and relationship with clients.
Approval to access integrated data rests entirely with the government statistician, who is also the chief executive of Stats NZ. Data suppliers may have individual memorandums of understanding with Stats NZ, whereby they can review and provide advice on project applications that want to use their data, but they do not have the power to veto its use. Project applications to use the IDI must meet strict data safety criteria. These are reviewed and assessed by the integrated data team, who then provide advice to the decision maker.10
At one level, Stats NZ has been on a data integration journey since annual statistical reports were first published by the government printer in the 1850s. The advent of modern computing has, of course, enabled this at an ever-increasing scale. It is hard to imagine that in the early 1980s, the Department of Statistics was the only government department that had its own mainframe. Each floor would have three to five terminals with analysts allowed 30-min segments to test their program before putting it in the batch system to run overnight. Analysts could sort and merge files with up to 300,000 records overnight.
With the widespread adoption of microprocessor computing in the 1990s came an enormous growth in digital production and storage of data. However, while this increased capability and capacity to link data, it was still incredibly difficult to do so. Any sort of data-heavy processing involved manual steps and very long wait times to process files. The current refresh of the IDI and LBD results in around seven billion rows of data, requiring about 650GB of storage. By modern standards this is not a large amount of data, but certainly well beyond the computing capabilities in the late 1990s. While there was some interest at a policy level in how data integration could help to gain systemwide insights, the tools were not necessarily at our disposal to make this happen.
There was a very acute focus on cross-agency data integration in the mid-to-late 1990s in Aotearoa New Zealand, culminating in a Cabinet Directive stating that Stats NZ was the right place to carry out this work. Stats NZ is not an operational agency, and the independence of the government statistician provided the neutrality required to undertake data integration safely. Some small-scale data integration happened in the 1990s and into 2000, but it was the introduction of the Linked Employer-Employee Data (LEED) project in 2005 that really marked the beginning point for our current integrated data program.
To establish the LEED data, Aotearoa New Zealand looked to international experience (linked employer-employee data sets were already established in parts of Europe and North America). This culminated in the New Zealand Conference on Database Integration and Linked Employer-Employee Data held in 2002. The conference successfully proved that in the Aotearoa New Zealand context:
Integrated data could be built, while maintaining confidence in the linkages and data quality.
The value-added from this kind of approach was potentially enormous, with some key issues about the economy likely unable to be addressed without integrated data.
With a careful and deliberate approach, confidentiality and privacy issues can be addressed.
Learning from the international experience provided Aotearoa New Zealand with a solid foundation to take our integrated data program forward.11
The LEED project involved linking employee tax data with a database of Aotearoa New Zealand businesses, and an integrated data set on student loans and allowances. These early integrations were for specific purposes, often linking only two or three data sets, with each kept in a separate environment for security purposes. Linking student information to labor market and business data enabled analysts to quantify the impact of government funding for tertiary education on labor market participation and broader economic outcomes.
In the ensuing years, a range of other bespoke linkages were undertaken, requiring an increase in linkage capability and capacity inside Stats NZ. The IDI prototype was established in 2011, which consolidated the previously separate integration projects. The early prototype included economic, education, social welfare, migration, and business data. The IDI prototype increased the flexibility to respond to changes and development in source agency administrative data sets and to update statistical processes and outputs.
As outlined above, this prototype coincided with a significant push at the political level toward evidence-informed policy development and delivery. The government at that time had recently been elected for a second term. In 2012, they launched the Better Public Services program, setting targets for the public sector to achieve to deliver improved results for New Zealanders. The Better Public Services advisory group encouraged an increase in cross-agency cooperation, particularly on complex issues that span agency boundaries.
In 2013, Cabinet agreed that the delivery of the Better Public Services program would benefit from stronger, more coordinated capability across government. This was to have a dual focus on a central analytics and insights function, which was set up within the New Zealand Treasury, and a data-sharing solution to better share and use existing data sets. The IDI was selected as the data-sharing solution and received funding to further expand the data and access service.
Much of the focus for Better Public Services and Social Investment had been on the social services sector (e.g., Social Investment Unit, 2017). The early driver was to identify and evaluate policies that reduced people’s long-term welfare dependency, which the government of the time saw as key to improving the broader set of social outcomes for this group of vulnerable people. The government saw this in mainly fiscal terms—investing to reduce the liability associated with the benefit system and grow the economy.
The broad intent of the Social Investment approach was continued with the change in government in 2017. A new ministerial portfolio was created to provide evidence-based information, tools, guidance, and products to social agencies, with a stronger emphasis on social wellbeing as opposed to purely fiscal drivers. The IDI has become no less relevant in this context and Stats NZ continues to develop our capability to deliver the information to inform this policy direction.
Around this time, Stats NZ was also formally mandated as the functional leader of the government data system. The Public Service Commission, who is the employer of public service chief executives, nominated the government statistician as the government chief data steward (GCDS). This role has enabled Stats NZ to work in partnership with government agencies to set the strategic direction for government’s management of data and unlock the value of data for all New Zealanders. The GCDS role also leads the public sector’s response to new and emerging data issues. This role was strengthened in September 2018 through a Cabinet mandate empowering the GCDS to set mandatory standards and guidelines for the collection, management, and use of data by government agencies and directing agencies to adopt common data capabilities (e.g., data tools, linking infrastructure, or other sharing platforms).
This mandate has the potential to reshape the way agencies manage and share data to further amplify the integration capabilities that have already been established. While this mandate is an important lever for change, it will take many years to realize benefits. Implementing data standards within existing systems often carries a significant cost that seldom meets the threshold for investment. In addition, there are no hard consequences for failing to adopt mandated standards. The quickest wins have been gained by ensuring new systems adopt the mandatory standards, and Stats NZ works across the system to gain visibility on new investments in order to drive greater adoption of standards. Standards have already been mandated for date of birth, person name, and street address, with additional standards under development for gender and sex, ethnicity, tribal affiliation, Māori descent, and Māori business definitions.12
While we view our integrated data program as generally a great success, we have learned many lessons along the way. We offer seven observations for others embarking on this journey.
A. Start with a use case and let it grow.
Many projects in government and elsewhere start as good ideas but fail to deliver the intended value. There are myriad reasons for this, but they are often poorly planned, can be captured by scope creep, and become the solution to all organizational problems. We started with a simple use case, to understand simple relationships between education and training and labor market outcomes. Not only did this provide immediate benefit and constrain the scope, but it also enabled supplier agencies to build confidence in the linking process by trying things first. As benefits were achieved and none of their worst fears were realized, supplier agencies became more comfortable with putting their data in the IDI.
The other important aspect to this was letting the researchers identify where value could be found, rather than presupposing what data people might want to include. Early use cases in the health sector gave researchers the opportunity to test processes at their end to understand what they were signing up for. Once agencies got their own researchers using the data, the feedback loops closed more readily, and the comfort levels picked up as well. And so, it grew.
B. Think about future uses up front and design the infrastructure accordingly.
Notwithstanding lesson A, allowing the program to grow organically from a series of use cases can create problems downstream. In our own case, we now have a highly utilized system that is still designed as a prototype. Many of the processes have been added to as the program has evolved and we spend a lot of time cleaning and processing data between each refresh. The infrastructure investment has not kept pace with the growth in utilization, and a major reinvestment in the underpinning process and technology is required. Three key design principles worth considering are: (i) ensure your design is scalable, (ii) work with a modular approach where different component parts can be slotted in and out of your processing pipeline as required, and (iii) look to partner where you can. The last principle is important to ensure the user community has some ownership in the success of the product as well.
C. A clear authorizing environment matters, and a ministerial champion makes it even better.
The political support from the Minister of Finance, Hon. Bill English, helped some government chief executives to be less risk averse about putting their data in the IDI. Without this support we certainly would not have progressed as fast as we did, and we may not have reached our current state of maturity at all. Having a ministerial champion gave agencies an understanding of why cross-agency linkage was important, and it also helped them to form a plan for how and when they could make it happen. Ministers were putting pressure on agencies to go further and faster, which helped to build momentum. The other key driver was the Better Public Services policy and the associated drive on Analysis for Outcomes. This gave us the funding to grow the IDI, and at the same time the New Zealand Treasury was tasked with helping with the analytical capability across government. Having both of these things happening at the same time was very helpful for our work with supplier agencies.
D. Safe integration requires a strong regulatory environment.
Legislation is a critical enabler for successful data integration. The Statistics Act 1975 and the Privacy Act 1993 worked in combination to enable Stats to develop and use integrated data without the need for further legislation. The Statistics Act 1975 enables anonymized data to be used for bona fide research if the government statistician is satisfied that the researcher has the necessary knowledge, experience, and skills to use that data. Likewise, the Privacy Act (which was updated in 2020) enables the use of information within a framework that protects New Zealander’s privacy.
Stats NZ has recently implemented policy work to modernize our legislative framework. There was little or no digitization of personal information when the 1975 legislation was written, and there is not a single reference to the word ‘data’ contained in the act. The proposed legislation would modernize the language used and address some key inefficiencies and barriers to effective integration. One key proposal is to enable the government statistician to require the collection of information from other agencies for statistical purposes, even if that information is not required for the operations of the agency itself. This proposal recognizes the increasing direction successive governments are taking to develop more system-focused governance.
E. Be transparent and consistent in how the rules are applied and be prepared to adapt.
The success of any integrated data program will rest and fall on trust. Data suppliers need to trust that data will not be used in a way that could compromise their operations or in contravention of their legislation. The integrated data team needs to trust that researchers are using data in accordance with the access agreements in place. Members of the public need to trust that their data are used safely and ethically by government.
To maintain this trust, we have a very clear set of rules that govern access to and use of the IDI, and we are careful to hold all users to the same standard. However, these were not in place from the outset. Processes have evolved as the uses have changed and as the trust relationships with researchers have evolved. We were prepared to review our systems and processes, which are often the touch points that can be annoying for customers. For example, we have made changes to the output checking process as well as the project application process to streamline and strengthen them over time.
F. Data quality is a shared responsibility.
Data quality is an ongoing issue for us, but it has improved over time. In our experience, the best way to get agencies interested in improving their data quality is to stimulate them into becoming users of their own data. That said, we still spend most of the time between refresh cycles going back and forth with agencies to clean data in preparation for integration. Expectations around data quality have matured over time, but upstream accountability for data quality should be made clear with data suppliers up front. System assets need system accountability. A key pain point will be metadata (data that describes the data in the IDI). This is a key issue for researchers who use the IDI, and it cannot be understated. This will probably never be perfect, but it does need investment and focus from data suppliers and the integrating entity.
G. Maintain an active network of analysts.
Researchers who came into the IDI environment who deeply understood the vagaries of data from their own field (e.g., health) quickly realized that data from other fields (e.g., labor market) were just as idiosyncratic. They realized that they needed to share knowledge across disciplines and those who collaborated more tended to produce the highest quality research.
Stats NZ has worked to create a user community around the IDI. The Data Lab Community Group, which is comprised largely of technical leads, is facilitated by Stats NZ but chaired by an external member. There is an active mailing list, user forums, and an online community where researchers can discuss their research and share code. Research using the IDI is published to ensure that other researchers can see what has been done in the past and share their experiences with the broader IDI community. Stats NZ also helped to establish a Government Analytics Network (GAN) and supported a wider community of interest known as the NZ Data Analytics Forum (which involves private sector interests as well). The GAN has been most effective when there are concrete cross-agency problems to solve, often involving linking data.
Networks have also sprung up within fields; the Virtual Health Information Network13 being a good example of an active and useful self-generated user group.
This article has presented a case study on how integrated data has been developed and used in Aotearoa New Zealand. We have outlined some of the innovative uses we have found for integrated data in our country, the operating model for the program, the journey we have gone through to arrive at our current level of maturity for integrating data, and some of the lessons learned along the way.
The program has been successful due to a confluence of circumstances. We had a motivated National Statistical Organisation which has seen the benefits of integrating data for statistical and research purposes. We had a motivated set of central, social, and economic agencies wanting to answer important policy questions that their own data could not answer in isolation. And, perhaps most importantly, we had support from political leaders to invest in the infrastructure and capability required to deliver these services.
We hope our experiences are helpful for other countries or jurisdictions wanting to go down this path. We see few barriers to other countries adopting a similar approach but acknowledge that context will determine how far and fast countries can go. Aotearoa New Zealand is fortunate to have relatively high wealth per capita, low levels of corruption, highly trusted institutions, and a flat government structure. This has created an environment where investment in integrated data can be made and where data can be integrated in a safe and secure way while maintaining the trust and confidence of citizens in the way their information is being used.
Not all countries will have these foundations in place. Developing countries often lack the financial resources to invest in the base infrastructure needed to collect, manage, and use data safely. Data quality is an issue for any integrated data program but can be a particular challenge where infrastructure investment is low or where there are low levels of trust between citizens and government.
Our understanding of what it takes to build and maintain trust in integrated data is continuing to develop, especially as it relates to our indigenous Māori population. Trust in government is much lower among Māori, on average, than the rest of the population (Stats NZ, 2018). This low trust is driven by a multitude of factors relating to our history, as well as continuing inequities in outcomes for Māori that can be directly traced back to colonization two centuries ago. Among other things, the Treaty of Waitangi guarantees Māori absolute sovereignty over treasured items and many Māori see data as one such item. There is a very active discussion in Aotearoa New Zealand about what this means for collecting, managing, and using Māori data, including for integrating across multiple data sets (Kukutai & Taylor, 2016). Time will tell if frameworks such as Ngā Tikanga Paihere help to build and maintain trust for how government agencies use data to shape services in a way that improves outcomes for Māori, or whether a different governance mechanism is required.
Having a flat government structure has been a particular benefit for our integrated data program and can be a challenge for other jurisdictions. Aotearoa New Zealand is a relatively small country of around five million people, with a parliamentary democracy that has its roots in the British Westminster system. All major social services are delivered nationally, with local government accountable for services such as planning, waste management, water, and other utilities. Integration can be more difficult in jurisdictions where accountabilities are split between federal and state governments (e.g., where population data is held nationally, and social service data is held at the state level). This can be especially acute where states are competing for national funding to support delivery of local services. Structuring an integrated data program in these circumstances may require additional policy or legislative structures to give data suppliers confidence that information will not be used to the (real or perceived) detriment of the people they serve.
Notwithstanding these contextual considerations, many of the basic structures, practices, frameworks, behaviors, and processes developed in Aotearoa New Zealand should be directly relevant for all jurisdictions. We trust that these reflections are useful, as our experience has shown us the huge benefit that can be gained from integrating data for policy and service delivery.
Craig Jones, Anna McDowell, Vince Galvin, and Dorothy Adams have no financial or non-financial disclosures to share for this article.
Allan, C., & Mare, D.C. (2021). Do workers share in firm success? Pass-through estimates for New Zealand (CEU Working Paper 21/03). Ministry of Business, Innovation and Employment. https://www.mbie.govt.nz/dmsdocument/17166-do-workers-share-in-firm-success-pass-through-estimates-for-new-zealand
Allan, C., & Sanderson, L. (2021). Labour market impacts of technology change: Evidence from Linked Employer-Employee Data (CEU Working Paper 21/02). Ministry of Business, Innovation and Employment. https://statsnz.contentdm.oclc.org/digital/collection/p20045coll17/id/1186/rec/1
Chappell, N., & Sin, I. (2016). The effect of trial periods in employment on firm hiring behaviour (Working Paper 16/03). New Zealand Treasury. https://www.motu.nz/our-research/population-and-labour/individual-and-group-outcomes/the-effect-of-trial-periods-in-employment-on-firm-hiring-behaviour/
Gibb, S., Bycroft, C., & Matheson-Dunning, N. (2016). Identifying the New Zealand resident population in the Integrated Data Infrastructure (IDI). Stats NZ. http://www.stats.govt.nz/research/identifying-the-new-zealand-resident-population-in-the-integrated-data-infrastructure
Hendy, S., Steyn, N., James, A., Plank, M., Hannah, K., Binny, R., & Lustig, A. (2021). Mathematical modelling to inform New Zealand’s COVID-19 response. Journal of the Royal Society of New Zealand, 51(sup1), S86–S106. https://doi.org/10.1080/03036758.2021.1876111
Kukutai, T., & Taylor, J. (2016). Indigenous data sovereignty: Toward an agenda (CAEPR Research Monograph No. 38). ANU Press.
Lane, J., & Maloney, T. (2002). Overview: The New Zealand Conference on Database Integration and Linked Employer-Employee Data. New Zealand Economic Papers, 36(1), 3–7.
Ministry of Education. (2019). Accounting for educational disadvantage. He Whakaaro: Education Insights. http://www.educationcounts.govt.nz/publications/series/he-whakaaro/he-whakaaro-accounting-for-educational-disadvantage
Morris, M., & Sullivan, C. (2015). The Impact of sentencing on offenders’ future labour market outcomes and re-offending—Community work versus fines (Working Paper 15/04). New Zealand Treasury. https://www.treasury.govt.nz/publications/wp/impact-sentencing-offenders-future-labour-market-outcomes-and-re-offending-community-work-versus-fines-html
New Zealand Government. (2011). Better public services advisory group report. https://www.publicservice.govt.nz/assets/Legacy/resources/bps-report-nov2011_0.pdf
New Zealand Institute of Economic Research. (2016). Defining social Investment, Kiwi-style (Working Paper 2016/5). https://nzier.org.nz/static/media/filer_public/e8/56/e8566475-e1c7-4a2c-9f0f-bf65710b039b/wp2016-5_defining_social_investment.pdf
Raubal, H., & Judd, E. (2014). Work and income 2013 benefit system performance report for the year ended 30 June 2013. Ministry of Social Development. https://www.msd.govt.nz/documents/about-msd-and-our-work/publications-resources/evaluation/investment-approach/2013-benefit-system-performance-report.pdf
Raubal, H., & Judd, E. (2018). 2017 benefit system performance report. Ministry of Social Development. https://www.msd.govt.nz/documents/about-msd-and-our-work/publications-resources/evaluation/2017-benefit-system-performance-report-june-2018.pdf
Sin, I., Stillman, S., & Fabling, R. (2017). What drives the gender wage gap? Examining the roles of sorting, productivity differences, and discrimination (Motu Working Paper 17-15). Motu Economic and Public Policy Research. https://motu-www.motu.org.nz/wpapers/17_15.pdf
Social Investment Agency. (2017). Social housing technical report: Measuring the fiscal impact of social housing services (SIA Working Paper 1910.DP.REP.HOU). Social Wellbeing Agency. https://swa.govt.nz/publications/reports/
Social Investment Agency. (2018). Are we making a difference in the lives of New Zealanders—How will we know? A Wellbeing Measurement Approach for Investing for Social Wellbeing in New Zealand (SIA Working Paper 2018-0106). Social Wellbeing Agency. https://swa.govt.nz/assets/Uploads/Are-we-making-a-difference-in-the-lives-of-New-Zealanders-how-will-we-know.pdf
Stats NZ. (2018). Kiwis perceive high political trust but low influence. https://www.stats.govt.nz/news/kiwis-perceive-high-political-trust-but-low-influence
Stats NZ. (2019a). Overview of statistical methods for adding admin records to the 2018 census dataset. http://www.stats.govt.nz/methods/overview-of-statistical-methods-for-adding-admin-records-to-the-2018-census-dataset
Stats NZ. (2019b). Dual system estimation combining census responses and an admin population. http://www.stats.govt.nz/methods/dual-system-estimation-combining-census-responses-and-an-admin-population
©2022 Craig Jones, Anna McDowell, Vince Galvin, and Dorothy Adams. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.