Skip to main content
SearchLoginLogin or Signup

Data Science for Central Banks and Supervisors: How to Make It Work, Actually

Published onJan 30, 2025
Data Science for Central Banks and Supervisors: How to Make It Work, Actually
·

Abstract

Public authorities, such as central banks and supervisory authorities, are not known for their ability to quickly adopt new techniques in a rapidly changing world. However, these authorities play a central role in society, such as safeguarding the financial system. The challenge of keeping the financial system safe is formidable and data science could potentially help. We discuss how to leverage the potential of data science using our experience at one of these organizations: De Nederlandsche Bank (DNB), the Dutch Central Bank. The dual role of DNB as central bank and prudential supervisor ensures that the lessons learned are of interest to all stakeholders in the public and financial sector. Furthermore, by adopting a strategy that prioritizes cloud over on-premise IT infrastructure and effectively establishing a Data Science Hub (DSH), the knowledge gained will have wider applicability.

The goal of our study is two-fold. First, we demonstrate the significant potential of data science in nine lessons, all supported by our own projects. Based on our experience, we highlight the aspects necessary for fruitful data science work. It is a common misconception that getting data science to work for an organization can be achieved by hiring a few smart data geeks and having them develop ‘AI’ in a remote corner of the organization. We will argue that AI should become part of daily work processes to reap the full benefits. Second, we share how we work at the DSH with the intent of providing practical guidance and inspiration to other organizations that are thinking about implementing data science.

Keywords: data science, central banking, supervision, project implementation


1. Introduction

New data sources and new techniques are rapidly providing new possibilities for companies, improving the way they work. In this article, we present our experience on how to effectively employ data science in a central bank, supervisor, and resolution authority (De Nederlandsche Bank [DNB]). As common for a central bank, DNB is in charge of a variety of tasks, ranging from ensuring financial stability and a smooth payment system to implementing monetary policy. A supervisor is in charge of supervising a wide range of financial entities, such as banks, insurers, and pension funds, whereas the resolution authority focuses on the orderly winding down of failing institutions. DNB’s multiple roles of being a central bank, a supervisor, and a resolution authority thus provide an excellent use case for how to implement data science in an institution that is charged with a variety of tasks that are not necessarily related. Following a trial in the Statistics Division, data science was formalized by the establishment of the Data Science Hub (DSH) in 2020. The DSH is tasked with promoting data-driven ways of working and fostering the data science community. It is the hub in a hub-and-spoke model, working with the various divisions throughout the organization. The aim is to add to more traditional analysis approaches by using data science techniques, commonly known as machine learning (ML) or artificial intelligence (AI). Following the definition of Nasution et al. (2021), we define data science as the extraction of knowledge from high-volume data, using skills in computer science, statistics, and expert domain knowledge. AI can be loosely defined as any system that perceives its environment and uses learning and intelligence to take actions that maximize their chances of achieving defined goals. In the field of supervision, the tools developed are known as ‘SupTech’ and allow supervision to become more efficient or have a more complete risk capture.

The goal of our study is two-fold. First, we demonstrate the significant potential of data science in nine lessons, all supported by our own projects. For a complete overview of all the projects we have undertaken in recent years, see the annual reports published on our web page. Second, we share how we work at the DSH, focusing on what works to fruitfully use data science in an organization. It is a misconception that putting data science to work for an organization can be achieved by hiring a few smart data geeks and having them develop ‘AI’ in a remote corner of the organization. We will argue that AI should be fully incorporated into daily work processes to reap the full benefits.

The following summarizes our lessons over the past few years:

  1. The combination of—sometimes novel—granular data offers new insights useful for both supervisors and policymakers.

  2. Combining internal data with external resources increases the information value of the data, allowing supervisors and policymakers to improve the incorporation of external factors and risks. A mature and widely supported framework for data governance is needed to work responsibly with the various data sources.

  3. Automating data processes results in more efficient, accurate, and frequent economic (prediction) models and allows policymakers to focus on modeling rather than preparing the data. Responsible coding is needed to ensure the replication and reproducibility of analyses and policy decisions.

  4. A data science function should be able to combine many different activities, and this requires the right mix of skills. Communication and being able to explain complex matters to the business is one important aspect as the real value of data science lies in the combination of data science techniques and domain knowledge.

  5. Machine learning has great potential and outlier detection models have already proven to be effective innovations in supervision. Complex machine learning approaches, such as neural networks, are, however, not a prerequisite for a successful data science project. In many cases, a simple model is a great place to start.

  6. The value of data science applications depends on the adoption by the business and therefore user-friendly interfaces to integrate the data science solution into the daily workflow are important to be able to reap the benefits.

  7. A well-functioning IT environment is needed to facilitate data scientists, which can only be realized in close cooperation with IT.

  8. Data science has value for the entire organization, including, for example, HR and business operations, and should therefore be in the ‘heart of the organization.’ Embracing new data science methods and using them throughout the organization requires sufficient appetite for experimentation—especially at senior levels—and therefore a common vision is key.

  9. Last but not least, as the prior lessons show, fulfilling the conditions to make data science work is costly and takes time. The crux lies in tempering inflated expectations but also investing in and being transparent about projects that are more likely to pay off in the long run.

Our study is organized as follows. Section 2 introduces the DSH and elaborates on organizational considerations when setting up a data science function. In Section 3, we showcase the value of data science for organizations based on our data science projects. This section relates to the first five lessons (see above). We then discuss the necessary conditions for employing data science for maximum impact in Section 4. Lessons 6–9 are presented here. Lastly, we explain our day-to-day operations in Section 5. This section is intended to provide practical guidance and inspiration to other organizations that are considering implementing data science roles.1 We thus leave out much to the technical details of the—sometimes highly technical—solutions we have provided to our clients.

2. The Data Science Hub

Recognizing that becoming a more data driven organization requires dedicated resources, DNB set up the DSH in 2020. The DSH is the hub in a hub-and-spoke model that connects data professionals throughout the organization. The main goal is to help colleagues to become self-starting data scientists. Depending on the skill level, this can entail simply moving from Excel to a coded solution to training quite advanced machine learning algorithms. As will be discussed in Section 5.3, we have drafted a Data Science Manifest that outlines the most important things to be mindful of in this journey.

The unit has a modest headcount of 7 full-time equivalents (FTEs), sometimes supplemented with secondments from other departments and visiting PhD students. The limited resources imply that we have to make some scoping choices. For one, we do not do projects that are primarily about dashboarding or robotic process automation. Elsewhere, we have excellent teams that focus on these topics.

In the run-up to the establishment of the DSH, the optimal location was the subject of extensive debates. Ideally, the location should facilitate bringing together the business, IT, and data. Currently, it is located in the Economic Policy and Research division. This implies that the DSH is in a business division and close to the ‘Science’ of data science. The alternative locations discussed were the IT and Statistics divisions because of, respectively, the IT and data component. In the end, the board decided that a unit tasked with solving IT and data issues across the entire bank should not be located inside a division that was primarily tasked with IT or data. Although being close to these subjects can have benefits, it might also make it more likely that the DSH would be sucked into the ‘run’ problems of the hosting division. Depending on the institutional structure, the optimal location in other organizations might of course be different.

It is generally accepted that support from senior management is important in driving change. To secure this support, the DSH has a board sponsor with whom we have regular meetings. In addition, we meet with each of the other board members at least once a year. To increase understanding of how new techniques might impact how analysis is done, we have organized sessions with board members and senior management to work through a use case in a prepared Jupyter Notebook. This led to interesting discussions on how analysis of massive data sets can be meaningfully reported to busy managers.

3. The Value of Data Science

Since the start, we have completed close to 70 projects, and in this section, we will discuss a small selection chosen to support the first five of our presented lessons. The first use case covers our projects using large granular data, supporting our first lesson, L-1, on the use of these data to gain new and relevant insights for supervisors and policymakers. Such data are technically challenging to analyze but hold much promise—especially when combined. L-2 is supported by our second use case on a digital twin model for climate risk, discussing the advantages of combining proprietary data with external data. The third use case highlights how automation can lead to efficiency (L-3), especially if it is gradually introduced. Fourth, we present a use case where domain knowledge was crucial for the project’s success, and this supports L-4. Finally, we demonstrate that relatively simple data science models are a great way to start (L-5).

3.1. Granular Data Sets (L-1)

The 2008–2010 crisis revealed that authorities were missing crucial information to accurately identify risks in the financial system. This realization led to a significant increase in the volume and granularity of the data that financial institutions are required to report. For Europe, for example, granular information on credits (AnaCredit), money market transactions (MMSR), securities financing (Securities Financing Transactions Regulation [SFTR]), derivative trades (European Market Infrastructure Regulation [EMIR]), security holdings (Securities Holdings Statistics [SHS]), and trading (Markets in Financial Instruments Directive [MiFId]) is being collected. Although data quality issues remain, these very granular data allow for unprecedented coverage of all major activities in the financial sector (Ullersma & Van Lelyveld, 2022).

Combining the new granular data in a coherent framework would allow for an even better understanding of the dynamics of the European financial system. Here, some challenges remain. We list three of the main challenges we have faced and show how we have coped with them.

First, in many cases, reporting agents are free to submit counterparty names as free-form text. As Section 3.1 shows, exposures to the same agent can therefore be labeled slightly—or completely—differently, underestimating the concentration of risks. To map a differently spelled counterparty to a single and unique identifier, we have developed a “fuzzy name matching” package, which is available on GitHub. The figure shows how a single institution may show up in data sets with many differently spelled names, providing a clear case for our fuzzy name matching package (Nijhuis, 2022).

Figure 1. The case for “fuzzy name matching.” From Data Science Hub (DSH) project documentation.

Second, an issue in merging granular data sets is that not all entities included in the data have a single unique identifier. After the global financial crisis, the Legal Entity Identifier (LEI) was introduced. The potential of the LEI for supervisors, financial markets, and institutions is significant. It not only introduces unique firm identifiers, but also contains information on ownership structures. Thus, one is able to, for example, plot intra-firm networks as is shown in Section 3.1. The figure plots the intra-firm network of HSBC Holdings PLC. Every dot represents an entity with a Legal Entity Identifier (LEI) code that belongs to HSBC Holdings PLC. The colors of the nodes represent the country of the reporting entity. Such renditions are relevant for supervisors, as regulation and resolution regimes differ by jurisdiction. An interactive dashboard would allow users to drill down and access further information about entities of interest. Unfortunately, substantial data issues, including the current low coverage of the data, impede their use and underscore calls for a further roll-out of LEI reporting (Rietveld et al., 2023).

Figure 2. Intra-firm networks using LEI data. From Rietveld et al., 2023, p. 10.

Third, by focusing on one specific topic, one may simply not get the complete picture. For example, due to the over-the-counter (OTC) nature of derivative markets, there is no centralized overview of the market. Participants only observe their own volumes and exposure concentrations. In the 2008 crisis, for example, the major U.S. investment banks therefore did not realize that jointly they were massively exposed to a single entity, the lightly regulated insurer AIG. In setting their capital buffers and implementing other risk-mitigating procedures, they were thus ignoring an important yet unobserved concentration risk.

In one of our projects, we have measured the degree of interconnectedness of derivative positions between institutions for derivative contracts for which at least one counterparty is established in the Netherlands (Van den Boom et al., 2021). Section 3.1 shows a network based on interest rate derivatives. The colors of the dots indicate the different sectors and the sizes reflect the (logarithmic) aggregate size of the derivative positions. The system is highly interconnected with three large Dutch banks that facilitate most clients and have access to central counterparties (CCPs). The cluster in the middle indicates that Dutch pension funds and insurance companies do not trade exclusively with Dutch banks, but also use large international banks and CCPs for their derivative trading. The policy implication is that in the supervision of local pension funds, the health of particular institutions in international capital markets should also be monitored.

However, this network is based only on interconnections via derivative trades. To get a complete view of how interconnected the financial system is, one should actually consider different types of linkage. Currently, in one of our projects, we have combined several granular data sets with the aim of coming to a comprehensive view of exposures of Dutch banks across all traded instruments.2 The data included come from the AnaCredit credit register, SHS, SFTR, and, for derivatives, EMIR. We have combined these data sets using the aforementioned LEI information and data obtained from the Register of Institutions and Affiliates Database (RIAD). Given the confidentiality of some of these data sources, we do not show the output here. However, we can say that the established network clearly shows that for adequately identifying concentration risk, one should not focus solely on a single source of exposure, but take into account different sources of (preferably) granular data.

Figure 3. Network analysis of interest rate derivatives. From Van den Boom et al., 2021, p. 7.

The increasing volume and use of data results in more attention to privacy and protection of data. On the one hand, this is covered by (recent) regulations regarding data.3 On the other hand, for organizations with a long history, this requires a new way of working with respect to, for example, data ownership. Data should be carefully managed throughout the organization, with data owners responsible for the use and management of their data. For a data scientist, this implies that there is an important initial step before actually starting to work with the data: Request formal approval to use the data, providing the owner with information on the goal, the timeline, storage of (intermediate) results, and the output of the project for which the data are going to be used. At the same time, there should be enough flexibility in the data governance process for the organization to be able to analyze data in a timely manner and respond to developments and changing business needs. It is therefore important that both the data owner and the data scientist know what they can expect from each other in their respective roles.

Data governance is present in each step of a data science project—also during the modeling phase. For example, one concern related to foundational models4 is that they often require confidential information to be sent to the model (hosted outside of the organization) or, even more concerning, that this information is used to further train the model. To alleviate these concerns, OpenAI—one of the leading companies developing foundational models–is working toward developing trained models that can be used offline. This would allow organizations to implement a foundational model within their own infrastructure without having to share confidential data.

3.2. Combining Internal and External Sources (L-2)

The data you own is much more valuable to you if it is augmented with data owned by others (Mewald, 2023). This is also the case for central banks. The data a central bank receives through regular reporting can become even more informative if we add nontraditional data. For example, Van Dijk and De Winter extract topics from a large corpus of Dutch financial news (spanning January 1985 to January 2021) and investigate whether these topics are useful for monitoring the business cycle and nowcasting GDP growth in the Netherlands (Van Dijk & De Winter, 2023). Their newspaper sentiment indicator has a high concordance with the business cycle and increases the accuracy of DNB’s nowcast of GDP growth, especially in periods of crisis. Therefore, tone-adjusted newspaper topics seem to contain valuable information not included in traditional monthly indicators from statistical offices.

Of course, adding other data is not a new idea. Hedge funds, for example, have been using ‘alternative data’ for decades. One of the first companies to use alternative data such as satellite imagery, web scraping, and other creatively sourced data sets was Renaissance, a hedge fund looking for an edge in trading. A big bank like UBS uses satellite imagery of big retailers’ parking lots and correlates car traffic with quarterly revenue, generating accurate predictions of earnings before they are released. Another great example is that of the Banque de France (Bricongne et al., 2021). They use freely available daily granular satellite data for air pollution to predict industrial production. These cases clearly show that alternative data, that is, external data without a direct link to the own business, can help.

In our context, we combine alternative data with confidential supervisory data on geolocated real estate loans to generate a digital twin of climate risks building on initial work done in the Bank of International Settlements (BIS) Innovation Network. Here, a digital twin is defined as a digital representation of a real-world entity or system.5

The digital twin can be used to assess the effect of climate events on the financial system through real estate exposures of financial institutions. For the Netherlands, a flood risk case is included based on existing research on damages caused by specific water depths. Basic housing information, housing price statistics, and zip code maps were combined to map real estate exposures to the flood map and determine loss estimates for the financial industry, that is, banks and insurance companies (see Section 3.2). The figure shows estimated losses for the Dutch financial industry in the case of a flood risk scenario (with an estimated probability of once in 100 years). Such analyses have been undertaken by (re)insurers, but with this tool the methodology can be applied to confidential supervisory data.

Figure 4. A Digital Twin pilot for climate risk. From Data Science Hub (DSH) project documentation.

The framework is flexible and we are currently working on including live water level information, satellite feeds, and other risks (e.g., wildfires). The code is open source, and we are actively trying to interest other authorities to adapt this framework to their own data and their most relevant climate risks.

3.3. Automating Data Processes (L-3)

Until relatively recently the typical workflow for in-depth analysis was that data was collected manually from internal or external sources. Often data wrangling was a laborious job in Excel. Such manual processes are not only expensive, but also prone to human error. For example, for DNB’s internal inflation prediction model, external data was collected from various sources on a regular basis as input for the model. In fact, multiple processes within the central bank use external data sources, resulting in colleagues collecting (the same) data manually or via ad hoc scripts. This may also result in cases where different (or even outdated) versions of the same data set are used. The left-hand panel of Section 3.3 shows this situation.

In an ideal situation, that is, the right-hand panel of Section 3.3, colleagues in the same institution have immediate access to the same data, while restrictions dictated by privacy and confidentiality should be respected. Therefore, in addition to opening a discussion on how to modernize data workflows, we developed DataFetcher, a Python package that acts as a wrapper on top of publicly available application programming interfaces (APIs) granting access to various data sets (e.g., International Monetary Fund [IMF], Organisation for Economic Cooperation and Development [OECD], and European Central Bank [ECB]). Users no longer need to understand all the separate APIs, but can download data using unified syntax. Working closely with users in development allowed us to establish trust and leave users in control. Once the DataFetcher package was established, we started on infrastructure to collect the necessary data and fill a database on Azure—our cloud provider. Again, working closely with the modeling department allowed for skill transfer and the establishment of a sense of comfort with this new way of working. This approach is known as BizDevOps (developing and operating close to or by the users) and is especially effective if requirements are fluid or are to be fleshed out in the process.

Figure 5. The DNB DataFetcher. From Data Science Hub (DSH) project documentation.

The next step we are currently working on is to be able to automatically run models in the cloud. The ultimate goal here is to be able to easily interact with forecasts, for example, from a smartphone. It is not our ambition to conquer the market with this app, but the ability to quickly and painlessly change some part of the process allows for more flexibility in the development process. For example, it will be much easier to change the inflation forecasting model to incorporate unanticipated energy crises or pandemics. This, in turn, will improve policymakers’ ability to react to unforeseen events in a timely manner.

Data science projects often require multiple steps and different processes, from data cleaning to merging and analyzing the data, building models, and visualizing results. This requires an IT environment that allows for the variety of tasks and operations and offers the most important integrations, so that the data scientist is offered a seamless experience without having to switch to a different environment. In our experience, this requires quite some effort from IT. The environment, of course, must meet the requirements arising from internal data governance and security policies. However, the second most important thing is that the environment is user-friendly. The complexity lies in the fact that the platform has to serve different users with different desires with respect to the data, software, end products, and so on. Again, here there is a clear difference between fully fledged AI organizations that could build their data science platforms from scratch and the more traditional organizations that start to integrate data science in their existing IT environment. The latter is much more complex but is also much needed to keep data scientists happy.

The value of data science applications depends critically on the adoption by the business, and as such, the move to the production stage is key. Close cooperation with IT is crucial in this step to ensure that the transition to production runs smoothly and that the application meets internal policies and security standards. However, this does not start at the end of a data science project. As noted by Davenport and Malone (2021), rather than thinking of deployment as the last step in a linear set of activities, a data scientist—or at least key members of data science teams—should consider factors that have a strong influence on deployment throughout the data science project. In our experience, this comes with quite a difficult trade-off. On the one hand, you want the data scientist to think about the deployment stage upfront, while on the other hand, you do not want to limit the data scientist in experimenting with various data science solutions. Hence, short lines of communication with IT throughout the data science project are valuable.

Lastly, the production stage is not the final stage: it should be clear where the responsibilities with respect to maintenance and continuous improvement lie. For us, this is not always clear. There is an argument to place the responsibilities there where the application lands, that is, close to the end users. End users are the best place to formulate improvements, but are not always capable of implementing them. In this case, a move toward a BizDevOps way of working—where both development and maintenance is done in the business—could be a solution. However, organizing highly specialized skills in many places might be a challenge and hence suggest centralization.

3.4. Incorporating Domain Knowledge (L-4)

An example of a data science project that shows the importance of domain knowledge is our False Unfit Banknotes project. Commercial cash handlers send banknotes that they consider unfit for circulation to DNB. Cash handlers also manage ATMs in the Netherlands. Unfit banknotes are checked again at DNB because DNB has specific authentication sensors to determine whether a banknote is unfit for circulation. During the sorting process at DNB, a surprisingly large percentage of these unfit banknotes are evaluated as fit. We call these banknotes “false unfit.”

Figure 6. Detecting false unfit banknotes. From Data Science Hub (DSH) project documentation.

In cooperation with colleagues from the Payments division, the DSH investigated the high percentage of false unfit banknotes and how this percentage could be reduced. Looking at the data on the matched banknotes, it can be seen where the classification differs between DNB and the cash handler and specific rules that do not add up can be pinpointed. Section 3.4 shows the percentage of cases in which DNB and the cash handler classifications are in line. For example, in 93% of the cases both DNB and the cash handler decide to classify a banknote as unfit due to a folded corner, and hence in 7% of the cases the cash handler decides to classify a banknote as unfit due to a folded corner, while DNB does not. For fully compatible measurement, the diagonal of the matrix from the bottom left to the top right would be filled with dark squares, as the cash handler’s trigger would be identical to the DNB’s trigger. The number of DNB fit classifications if the cash handler detects a problem shows the extent of the false fit problem. Only hole size, tear size, and corner defects are regularly triggered for the same banknote by both the cash handler and DNB. While it is easy to compare the consequences of adjusting just one of the rules (e.g., tape decision or dirt), it quickly becomes more complicated once multiple rule settings are adjusted simultaneously. Therefore, we applied machine learning (that is, a genetic algorithm) to arrive at the optimal combination of multiple rule adjustments. Reducing the number of unfits can save much effort and expense, and this project resulted in a set of recommendations for our Payments division to achieve these cost reductions.

3.5. Starting Simple (L-5)

There is a lot to gain from machine learning, but when setting up the DSH we soon realized that starting with advanced machine learning models would not convince end users to use our products. For people with little knowledge of or experience with data science, more complex models may quickly seem like black boxes. During our first year, we therefore have put more effort into simpler projects, for example, guiding colleagues to use Python instead of Excel. Here, the main lesson is: start simple. Ideally, you start with projects that, over time, will allow you to answer more complicated questions. For example, first establish a solid pipeline to identify normal patterns in your data followed by a more complicated outlier detection algorithm. An important part of both the compilation of statistics and supervision is identifying observations that are out of the ordinary. In this section, we cover two of these projects that focus on outliers: first, an approach in which we implement reinforcement learning in granular prudential reporting, and, second, a Know Your Customer (KYC) use case.

The first use case is one in which we implement a reinforcement learning algorithm (Nijhuis & Van Lelyveld, 2023). Outliers are often present in data, and many algorithms exist to find them. In many cases, we can verify outliers to determine whether they are actually data errors. Traditionally, outliers are identified using ‘business rules’—ground truths that are valid by definition or result from experience. Assets should equal liabilities, for example. However, the definition and hard coding of business rules is cumbersome. Also, in some use cases we have not yet established strong priors for what is ‘normal.’ Unfortunately, checking such points is time-consuming, and the underlying issues leading to the data error can change over time. An outlier detection approach should therefore be able to optimally use the knowledge gained from verification of the ground truth and adjust accordingly. With advances in machine learning, this can be achieved by applying reinforcement learning in a statistical outlier detection approach. The approach uses an ensemble of proven outlier detection methods in combination with a reinforcement learning approach to tune the coefficients of the ensemble with each additional bit of data.6 In a reinforcement learning approach, an algorithm is not just trained once and then applied, but in each iteration, the algorithm gets feedback on its performance and adjusts accordingly. In our case, analysts are presented with a set of extreme values identified by the algorithm and then check each of these observations and record their evaluations. An advantage is that the observations with the highest outlier values are shown first. Then, when a new report is submitted, the algorithm takes the human feedback into account and provides the analysts with a new list of outliers.

We are currently working on implementing the reinforcement learning outlier detection approach for granular data reported by Dutch insurers and pension funds under their regulatory frameworks Solvency II and FTK. This application shows that outliers can be identified by the ensemble learner. Moreover, applying the reinforcement learner on top of the ensemble model can further improve the results by optimizing the coefficients of the ensemble learner.

The second use case is a KYC project. KYC is a mandatory customer due diligence process that requires financial institutions to verify the identity of the customer and assess and monitor their activities to prevent fraud. Since larger banks often have millions of clients and billions of financial transactions, data science has great potential to help monitor customers and identify potentially fraudulent transactions. In fact, it has already been applied. For example, Anzo (Cambridge Semantics) provides flexible knowledge graphs that allow institutions to connect customer information from structured and unstructured data and thus provides a data-driven solution for KYC processes.

Of course, the use of data science to monitor customers comes with additional challenges, such as discussions on consumer trust in technology and privacy. However, as Elliott et al. (2022) stress, without integrated and innovative contributions from the industry resulting in improved services, it will be impossible to shape a path toward more substantial technological innovations. Financial institutions have to comply with KYC guidelines and regulations, and supervisors, in turn, evaluate the efforts of institutions. Based on samples of client data from supervised entities, the DSH in cooperation with our colleagues in integrity supervision have therefore developed an outlier detection model to establish client profiles and map these to the risk classification of the supervised entity. With the model, we were able to effectively select clients with abnormal transaction profiles. More specifically, we applied an isolation forest outlier detection algorithm to millions of profiles (Section 5.4 in the Cambridge State of SupTech Report (Di Castri et al., 2023) provides a more extensive overview of our case). Section 3.5 shows a graph with outlier detection scores for bank clients plotted against two client characteristics (i.e. the sum of transactions (in euro) and the number of transactions by the client).

The results of the outlier detection model resulted in the identification of new risks and efficiency gains, as supervisors are now able to consider all transactions instead of considering only small samples. The model and results have been shared with the supervised banks to ensure transparency. This example clearly shows that the real value of data science lies in the combination of the domain knowledge of the supervisor and the computational power of a computer to analyze millions of client transactions.

Figure 7. Using outlier detection for integrity risk. From DNB DSH project documentation.

As stated above, the KYC project is a very good example that shows the importance of domain knowledge in data science projects. This was also stressed by DNB board member Steven Maijoor in his speech at the Data Science Conference (Maijoor, 2022) organized by the DSH in 2022 using the following example. To detect outliers in client transaction data, we traditionally define tell-tale identifiers such as ‘multiple accounts on a single address’ or ‘a single deposit per month and immediate withdrawal.’ Seen separately, these are relatively innocent. Together, however, they can indicate human trafficking of seasonal workers. The combination identifies subcontractors who organize housing for seasonal workers, which is a perfectly legal activity. But if at the same time the contractor immediately withdraws most of the wages deposited with only a fraction of the wage paid to the worker, it is an illegal activity. However, the combination could also be consistent with student housing: a large inflow when student grants and loans arrive and a relatively quick withdrawal rate. While these examples are just based on two dimensions, in practice there are many more dimensions, and these can interact in multiple and nonlinear ways. With the use of data science techniques, we can identify observations that are out of the ordinary on many dimensions. Exactly for this reason, data scientists should be in close contact with colleagues with domain knowledge not only to provide input for the model, but also to interpret model outcomes. To facilitate communication with domain experts, we use dashboards such as in Section 3.5. Here we plot two dimensions (the sum of and the number of transactions, respectively) and add information on their outlier degree based on all dimensions. Interacting with the data in this way allows supervisors to quickly identify client profiles that they want to discuss with the supervised institution.

4. How to Employ Data Science for Maximum Impact

As showcased in the previous section, data science has given us much in the last 4 years. Starting in 2020 we did not have a bankwide center of excellence like the Data Science Hub at DNB, and at the time of writing this article, we have worked on close to 70 data science projects throughout the whole organization. However, setting up a DSH is not a standalone business. Integrating it into the organization requires more; in this section, we discuss four lessons that we believe are important for an organization to become a more data-driven one that embraces data science. Note that the section is based on our experience and is dependent on how a data science team is organized within an organization. Functioning as a hub, the DSH undertakes data science projects in collaboration with other departments throughout the organization (the ‘spokes’). This section explains how we work and is meant to be a practical guide and inspiration for other organizations considering to set up a data science department.

Domain and data science knowledge should be matched. However, the job title ‘data scientist’ is relatively new and the function that a data scientist fulfills in an organization can vary quite a lot depending on the type of organization. In fully fledged AI-minded organizations, such as tech start-ups, data scientists may have well-defined responsibilities, such as being responsible for a specific part of a model. In organizations where data science is not (yet) integrated in the organization as a whole, the role of a data scientist may be less clear. In immature organizations, a data scientist is expected to attempt the almost impossible: excel at mathematics, statistics, machine learning, coding, data engineering, marketing, governance, and navigating the policies drafted for a different time.

As mentioned above, setting up the DSH in 2020 was not just about hiring a few data scientists. Of course, a data scientist should have sufficient technical skills. But another skill that turned out to be a need-to-have is the ability to communicate complex subject matter to a broad range of people, many of whom may never have heard of the term data science. The aim of the DSH is to work demand-driven to ensure that the things we do are not a result of available data and an eagerness to work on complex coding but have (potential) value for the organization. However, generating demand takes some effort. Policymakers and supervisors need to get a sense of how data science solutions may help them. This first requires a basic understanding of what data science can and cannot do. The next skill that turned out to be important is much more related to data engineering. Since the DSH works throughout the organization, the type of data we deal with varies from confidential supervisory data to financial market data sourced from a rating agency. This means that the data with which we work are stored at different locations in very different formats. Therefore, data engineering skills are necessary. Lastly, as discussed in Section 4.2, for a data science project to land in the organization, adoption by both the IT department and end users is key. In an organization where data science is relatively new, this requires the ability of data scientists to quickly find their way through the organization. Needless to say, the ability to integrate data science in the business is not only dependent on the skill sets of data scientists but may require additional skills from, for example, managers and business analysts. As stressed by Harris (2012), they must become i) ready and willing to experiment; ii) adept at mathematical reasoning; and iii) capable of seeing the big (data) picture.

4.1. Necessary Conditions: Data Governance and IT (L-6)

Since data science brings together data, IT, and business, it is clear that success depends crucially on the level of maturity of data governance and the IT platform. In terms of data maturity, DNB is on the same journey as most other government organizations. All data sets in DNB have a department head as owner. He or she might delegate day-to-day operations, but remains responsible for approving access and making the data available to users. There is a data catalog with the metadata and contact details. More and more data sets are becoming available on our data science analysis platforms. The IT platforms have been improving markedly in the last few years. DNB has been pursuing a cloud-first strategy and this is bringing a seamless integration of data storage and analysis capability closer (cf. Edmond et al., 2022). As DSH, we have been promoting a vision of “shifting gears without friction” where data can be connected to all available analysis and dashboarding tools with minimal effort. So, depending on the (changing) needs of the business, we might connect to data with Python, MATLAB, or some other language. For the initial analysis, we might use a Jupyter Notebook, while for the final product we might use Power BI or a web app. Ideally, all options will be available without having to sign off for each individual component or project.

4.2. Adoption by IT and the Business (L-7)

The real value of data science lies in its adoption by both IT and the business. L-6 already stressed the importance of data governance and the IT platform. In addition to these necessities, adaption is key. We can work on advanced models, but if they are not used in practice, the business value is not realized. Data science can, however, generate value in multiple ways, and here we distinguish two main types. First, data science projects can be a one-off project. The project discussed in the previous section, our False Unfit Banknotes project, is a good example of such a project: it generated clear value for the end users but will not be implemented in a day-to-day workflow. Although business commitment is still key to making such a one-off project a success, there is less need from the business to be able to write and maintain the product code. The second type of project is the one that finds its way to implementation for daily processes. This often comes with additional challenges. We have seen many promising proofs of concepts (PoCs) received with much enthusiasm that fail to make their way into production. Note that the success of any algorithm is crucially dependent on how seamlessly we can integrate innovation into the existing workflow.

Note that the concept of ‘in production’ is a source of great confusion between IT and the average user. Typically, analysis is used to formulate policy as soon as the policymaker is convinced that the results are sufficiently solid. The analysis is often somewhat of a journey and invariably involves manual steps. Effortless reproducibility is often not the first concern, and reproducibility is ensured because the analyst is closely involved and has intimate knowledge of how to replicate the results. For an IT department that is asked to bring such an analysis into production, the standards need to be much higher: the process needs to run without (much) manual intervention or knowledge of the subject matter. This involves programming to catch all kinds of eventualities and extensive unit testing. The challenge is to find an organizational form that allows abstracting away typical IT housekeeping tasks (e.g., ensuring reliable backups) while allowing the analyst sufficient flexibility in further developing the tool.

At this stage—bringing the data science application ‘in production’—frictions do not just result from the collaboration between the data scientists and the IT department. On top of the technical challenges comes the interaction with the end users. In developing a data science application, receiving feedback from the end-users to improve an algorithm has sometimes been challenging, since data science environments are separated from what the average user can easily access. In other cases, end users underestimate the considerable effort that is needed to train and tune a model. Based on smooth experiences with consumer apps, they have unrealistic expectations of what bespoke algorithms can do in the short run.

For a data science application to create value, adoption by both the IT department and end users is key. Therefore, one should be able to align the often divergent perspectives and desires of both sides. In our experience and depending on the size of the project, this takes considerable time. One of our first projects, Dataloop, is a good example of this. Dataloop was initiated as a collaboration between DNB’s Statistics Division and the DSH with the goal of improving data quality in supervisory reporting. Dataloop does so by centralizing and visualizing data from different sources. It also offers several feedback channels, for example, between analysts and data validation tools. The project started in 2018, and a lot has happened since then. Only recently, in September 2023, was Dataloop publicly launched as an application for Solvency II insurers under DNB’s supervision after a successful pilot. This is just to show that big data science projects like Dataloop take time, especially in this case where the end users are both internal clients (i.e., supervisors) and external parties (i.e., the institutions under supervision of DNB).

4.3. Reaching the Entire Organization (L-8)

The largest part of our time is devoted to the execution of data science projects. However, part of our mandate is also to advise and support the data science community throughout the organization. We do so by organizing training courses and a variety of activities.

On the regular program, we have the Open Source Lunches (OSLs) and Open Source Workshops (OSWs). At an OSL, we generally have two presentations about data science–related projects from DSH or other DNB colleagues. Sometimes we have a demo of a software package or an analysis platform. Attendance for the OSL is open to all DNB colleagues; however, a certain basic understanding of coding is needed to fully get the message. Therefore, OSLs typically attract more advanced users of code and modeling.

Depending on the topic, the OSWs may serve a broader group of colleagues. As the name already suggests, this activity takes the form of a real workshop. In about 3 hours, participants are introduced to a new data science technique or package. In the past, we have experimented a bit and have offered OSWs ranging from “Getting Started With Pandas in Python” to “Webscraping.” However, setting up a completely new workshop every 3 to 6 months is quite time-consuming. Therefore, we recently changed our focus and now offer three topics in cooperation with the DNB Academy; “Version Control With GIT,” “Clean and Responsible Coding,” and “Explainable AI.” The rationale behind this is that third-party providers can provide standard training for Python and other languages. However, for some topics, such as Global Information Tracker (GIT), we found that there is a large demand for this to be offered by people who are familiar with the DNB IT environment. The workshops are provided by our data scientists and are conducted in small, interactive groups of about 8 people.

Another event that we have hosted twice is the “Become a Datapreneur” initiative. This initiative is part of DNB’s digital ambition and is organized with the help of many other departments. Participants work on their own data science project for 5 months and learn how to improve their work with data science–driven techniques. No prior coding experience is required. The only two conditions are that the project has to be in line with the work of all group members involved and that the participants will do the data science themselves. The program offers support, such as personal coaching and data science workshops during this period.

On top of these regular events, every now and then we organize ad hoc events, varying from a virtual inspiration session (during COVID-19 times) to visits to peer institutions. In 2021, we organized the conference Central Bankers Go Data Driven: Applications of AI and ML for Policy and Prudential Supervision.

The above activities are undertaken to support the data science community throughout the organization. These activities, like the OSLs and OSWs, are mainly tailored to colleagues with some basic knowledge of data or data science. However, the majority of our colleagues belong to the other target group; policymakers and supervisors with less knowledge on this specific topic. Setting up the DSH, we aimed to work demand-driven, implying that—in the ideal situation—policymakers and supervisors come to us with their data science question. In practice, this requires a lot of effort. First of all, they are busy with their own work and need to find you at the right time. Second, after they have found you, their expectations often need to be aligned with what is reasonable to expect from a data scientist given the infrastructure. We soon discovered that passively waiting for requests is not working and started actively promoting data science throughout the organization.

The first step is to inspire people: What can data science offer? We give presentations at all levels throughout the organization and actively seek collaborations, for example, with our Innovation Office and HR. Regarding the latter, every 3 months, we organize a Data Party for new employees at DNB. The Data Party is a co-creation with our colleagues from the Data Office (a unit primarily tasked with organizing effective data governance). In under 3 hours, we let them experience the world of data and data science using a realistic case in an escape room setting. For example, a timed task is to order snippets of code provided to perform an analysis task. This has turned out to be an approachable and effective way to promote data science to new colleagues.

The second step is making available all project documentation internally. Finalized data science projects provide a very good—if not the best—example for colleagues. Sometimes, a data science project in one department can even be replicated for use by another department using a different data set. At the DSH we try to maximize the output of our finalized projects by promoting them internally. We therefore have all project documentation (see Section 5.1) from all finalized and ongoing data science projects presented in an approachable Power BI dashboard shown in Section 4.3.

Figure 8. Project overview DSH 2023. From Data Science Hub (DSH) project documentation.

Third, we also communicate our successes externally. A project does not end with the celebration of its completion (see Rule 10 of our Manifest in Section 5.3). Taking a moment to celebrate, reflect on what the project has brought and what we have learned, even for less successful projects, helps motivate the team. Furthermore, we actively seek ways to communicate the results to the outside world since projects may also be of interest to other central banks. Each year, we present an overview of our projects in our annual report published on our website. And if confidentiality allows, we share our code via the GitHub of DNB.

4.4. Managing Expectations and the Long Game (L-9)

A few years ago, data science was proclaimed to be the sexiest job around. Many organizations invested significant resources expecting results on a short notice. Disillusionment often followed as it turned out that simply hiring a data scientist with a blank check did not deliver structural business value. The necessary conditions to make the most of data-driven approaches are, among others, proper data governance, flexible IT infrastructure, and business managers interested in investing in innovation. Fulfilling these conditions is costly and takes time. Consequently, tempering the sometimes inflated expectations while at the same time avoiding killing off budding enthusiasm is a challenge.

In some cases, we have taken on projects that are more likely to pay off in the long run. For example, we are experimenting with motion sensors in our office building to predict how busy our cafeteria will be. Section 4.4 shows a snapshot of a live dashboard at 12:20 (and thus we only show the predictions for the afternoon). Such forecasts can help our catering service plan capacity and our staff make a more informed choice to time their lunch. While we can unsurprisingly predict the lunch crowd, there are still some issues to fix, since in our model people do not seem to leave the restaurant after lunch; a clear flaw in the calibration of the sensors. On a standalone basis, this project was not very successful. However, sensor data has great potential. They can also be used for other purposes, for example, to monitor the no-shows for meeting room reservations. The long game here is that through this project we are now involved in setting up the sensor infrastructure for the completely refurbished offices we are moving into shortly. The aim is to be able to store the right data while guaranteeing privacy and other regulations.

Figure 9. Restaurant visitors prediction using sensor data. From DNB DSH project documentation.

Other departments that are now also starting to get involved are those that work mainly with text. A large amount of information flows into DNB as text. Natural language processing and recent advances in LLMs such as ChatGPT have great promise for data science applications in all parts of a central bank. Colleagues in supervision have already experimented with a version tailored to our document infrastructure called Chat DNB. One hurdle we face is that document storage and retrieval have not evolved at the same pace. Documents are scattered in different systems, are not stored in a consistent format, and are difficult to access from our analytics platform. Notwithstanding these hurdles, we see more and more initiatives to make new data science techniques work for less traditional departments.

5. Organizing the Work

Data science is evolving quite rapidly, putting pressure on the traditional model of generating knowledge and insights from data. Just a few decades ago, much of the empirical analysis was based on manual manipulation of data points. Developing software to analyze data was very slow, costly, and required specialized knowledge. Since then, granularity, timeliness, and volume of data has shown exponential growth (Edge Delta, 2024). At the same time, computational power has exploded, compute costs have dropped significantly, and software development has become much easier. This combination opens many opportunities for supervisors and central banks. However, putting data science into practice requires an equal effort from both the end users of data science tooling—supervisors and policymakers—and IT. In our approach, we try to bring these worlds together.

Attracting a team with the right mix of skills is often mentioned as an obstacle, but fortunately it has not materialized for the DSH. The private sector can offer much better pay packages, and thus we need to compete in other dimensions. First, we serve the public cause and do not have a profit motive. Aligning private motivations with an employer’s aims and values is increasingly important. Second, our data scientists have access to data covering the entire financial system that is not available elsewhere. Third, we tackle a wide range of topics, with the aim of allowing data scientists to select their own projects as much as possible. Especially in larger organizations, data scientists tend to be strictly confined to a single topic or model. Since we serve all divisions of DNB we can offer a wide variety. Finally, some data scientists like to engage in academic research. This is not a regular activity, but if it aligns with our projects, it can be rewarding and sometimes lead to excellent publications (see Jansen, in press). Although sometimes we work with external consultants, we have a strong preference for permanent staffing of the DSH. We see this as a prerequisite for being able to work toward long-term structural change.

It is key to understand each other’s approach, both from the business as well as from the IT side. Often, the users of data science tooling expect a smooth user experience given what they are used to in using consumer apps. However, central banks and supervisors dealing with highly sensitive information need to tread carefully: the sensitivity of the data implies that, for example, sharing information with software solutions hosted elsewhere is not straightforward. On the other side is the IT department: they often know what is and is not feasible in the existing IT landscape, but may sometimes not fully understand what the end users are looking for. Finding the productive middle ground is difficult, but not impossible. Often, the solution is communication. Therefore, for an agile approach to working with data, development should be as close to the business as possible. This implies the willingness to allow activities traditionally reserved for IT departments to take place across the institution. At DNB, we are currently developing a “build-by-business” (3B) policy that details how to enable and maintain applications developed by departments other than our IT department.

An important organizing principle for our projects is that we co-create with the business. An advantage of this approach is the likelihood that, on completion, the project will be used is higher because the business understands the application. More importantly, it is a very effective way to prioritize resources: the business is best placed to identify the most pressing problems on the ground. The drawback is that the right mix of skills might not be available in the business area. In addition, this approach is not very well suited to very large projects, which are likely to end up as ‘managed by IT.’

5.1. Our Workflow

Data science projects can take a long time, and moving from the experimentation phase to the implementation phase can take even more time. The aforementioned Dataloop project is a good example of this. Therefore, it is important to clearly define the goals of a data science project in advance. In our onboarding procedure, we first have a scoping session with the business. We then fill out a form outlining the research question, the deliverables, the value for DNB, the assignment of responsibilities, and the planning (see Section 5.1). Once everyone agrees, we have the department head sign off to ensure business ownership. The onboarding document is intentionally lightweight because for most projects the goals are clear, but the actual implementation is something that needs to be figured out along the way. The typical planning horizon is about 3 months, forcing us to define realistic and thus small projects. Once finished, we reconvene and decide if a follow-up project is warranted.

Figure 10. Project onboarding. From Data Science Hub (DSH) project documentation. 

As shown in Section 4.3, we run multiple projects at the same time and the projects are always a collaboration between the DSH and the client (end user). In the ideal situation, the client is involved in the coding part. Especially if the project results in an application that is used in daily business, the client should have a thorough understanding of how to properly code. However, in practice, this is not always the case. This is simply because the client is tasked with a completely different job that does not require any coding skills. In such cases, more support from the DSH is required and provided. In the longer term, however, and to move toward the data-driven organization, one would like to have the data science knowledge embedded in the business in every field of expertise.

Throughout the project, we have regular meetings with the client and always aim to present the ongoing work during department meetings (both within the DSH and in the client department) and other sessions, such as the Open Source Lunches the DSH organizes (see Section 4.3).

At the end of a project, we offboard the project. Like the onboarding, this is done in cooperation with the client. We summarize the project output in the offboarding form (see Section 5.1). At the end of the project, we also send a survey to our client asking for feedback on our work. In addition to this, the survey also provides input to track our key performance indicators (KPIs).

Figure 11. Project offboarding. From Data Science Hub (DSH) project documentation.

As for tooling, we almost exclusively use open source coding languages (mostly Python, some R). These languages are developing extremely fast. More importantly, they are designed to be open and interoperable, which aligns with our aim of sharing the code and encouraging cooperation (see Section 5.4). While for real code development we use an integrated development environment (IDE, e.g., Visual Studio Code), Jupyter Notebooks are invaluable to let people interact and experiment with code and data. In terms of statistical methods, we take a pragmatic approach and try to solve the problem at hand with the simplest possible method rather than the most advanced one (see Chakraborty and Joseph, 2017, for an excellent overview of relevant methods). We are in close contact with teams that focus on business analytics (i.e., dashboarding in Power BI) or robotic process automation (RPA). Although cloud infrastructure is not a prerequisite, it allows much easier development and deployment of tools built.

5.2. Prioritizing Scarce Resources and Measuring Success

The available resources for innovation follow from an institution’s overall strategy. Like most organizations, DNB has defined an enterprise-wide digitalization strategy with an allocated budget. However, tactical use of these funds is delegated down the line of command. As mentioned in the discussion of our onboarding procedure, an important criterion for taking on a project is whether the business is willing to free up resources.

In selecting projects, we also involve other aspects. First, we have translated our mandate into an objective-goals-strategies-measures (OGSM) framework with measurable KPIs. Our mandate is to take care of advising, supporting, and executing data science projects throughout DNB. Translating a clear mandate and position in the organization into the OGSM framework helps focus our limited resources. With measurable KPIs—shown in Section 5.2—we can demonstrate how successful we are in realizing our mandate or where we need to put in more effort.7

Table 1. Data Science Hub goals and KPIs. Taking care of advising, supporting, and executing data science projects. From DSH Annual Report 2022, p. 7.

Goals

Key performance indicators

Projects

  • Broaden the knowledge of data-driven work at DNB by educating clients in projects

  • Providing reusable solutions to clients as stated in the Manifest

  • All the data science projects keep the Manifest in mind

  • 25% of the projects use commits from non-DSH members

  • At least 20% of our projects lead to reusability

Process

  • We aim for wide usability and the relevance of external stakeholders within our whole working process

  • At least 20% of the finished projects are relevant for a business unit other than that of the current client

  • At least 50% of the finished projects are shared externally

Relationship

  • We communicate about our proceedings so that our colleagues know where to find us for a collaboration on data science projects

  • We stimulate a data science community

  • We work on projects in collaboration with at least 8 different divisions

  • The average client satisfaction is at least 8 out of 10

  • 100 unique participants joined an activity organized by the DSH

Team

  • We build on our current knowledge

  • We want to influence relevant internal stakeholders

  • 50% of the chosen courses are studied and applied

  • The product owner of the DSAP team gives us an 8 out of 10 for our input

5.3. Providing Guidance

To provide ourselves and, more importantly, data scientists across DNB with guidance on the best practices in data science, we have drafted a Data Science Hub Manifest. We have formulated 10 core principles shown in Section 5.3. With these principles, we primarily target a nontechnical audience. The core principles are explained in more detail in an accompanying document. In addition, we have fleshed out the Manifest on a much more technical level for more adept users. To provide nontechnical data workers with further guidance, we have also developed a workshop where we discuss the spirit of the Manifest’s recommendations.

For data science results to be reliable, we need responsible coding. This not only implies writing code that is clean and readable but also involves things like regular reviews and unit tests. Unit tests are tests that ensure that the code still delivers the same results after a change has been made. Reviews and tests, in turn, are the key to being transparent about policy processes and working consistently over time.

Table 2. Data Science Hub Manifest. From Data Science Hub (DSH) project documentation.

Version control

1

Track changes in your code For reproducibility, it is vital to keep track of changes in your code. The standard way to do this is using a version control system like Git. If you are not familiar with Git, we strongly encourage you to start getting acquainted with it. At a minimum, you should archive copies of your code from time to time, to keep a rough record of the various states the code has taken during development.

Coding practices

2

Stick to language guidelines and practices Make sure to adhere to language specific guidelines related to naming, long lines, and other best practices. Make sure that you are consistent if you choose a certain style and do not mix and match. Add explanatory documentation strings and comments to help colleagues understand your code.

3

Give your code some thought Before starting to write code, try to define the requirements of your code, that is, what it needs to do. Which functions or other structures do you need and how are they interacting? This provides guidance, exposes possible bottlenecks early, and hence saves time later on. Moreover, cleaner, more structured code is easier to reuse.

4

Avoid manual steps In line with the above, avoid manual steps. Whether it is a manual data manipulation step or copy pasting certain code to repeat a process, this is not a good practice. Instead, write a function to do the data manipulation or write a function to repeat a process: that is what functions are for!

Peer review

5

Code review Peer reviews are essential in academia, medicine, and other work fields. We are no exception. Code reviews are an important aspect of code development. Regular reviews enhance code quality and structure and increase learning effects between colleagues. It is advisable that you work with another colleague on code development, such that a colleague can review your work. If you work alone, make sure to schedule a review with someone else.

Test

6

Verify if your code is correct Testing whether the code you wrote is behaving as expected is of vital importance to avoid bugs. This is usually (partly) done with (automated) unit tests. If you are not familiar with unit testing, we advise you to familiarize yourself with this concept. At a minimum, make sure you manually test your functions with a set of test inputs to see whether the output matches your expectations and keep track of the results.

Project management

7

Project structure When starting a project, take some time to think of an appropriate way on how to structure your files and folders. A logical structure paves the way to a clean and maintainable code base. Furthermore, it makes it easier for others to locate files and to contribute to your project.

8

Storing output Every project will have one or more outputs. Output can take several forms, ranging from trained/fitted models, figures, processed data, reports, and so on. Create directories to store these outputs.

9

Reproducibility: Track software versions Every project will use certain software. For reproducibility, it is essential to keep track of the exact versions of the software you use in your project. By doing this, the requirements to run the code/execute the program are documented.

Celebrate

10

Celebrate your successes! Achieving change is hard and takes a long time. For motivation it is therefore important to mark each achievement. The end of a project is a good time to acknowledge everyone’s contribution and think about lessons learned and what could be further steps.

5.4. Sharing Internally and Externally

Code, once developed, is basically free to share. DNB has decided to treat code as data and apply the existing data sensitivity framework. Internally, the code is thus available through Azure DevOps once the sensitivity is determined. Externally, the DSH operates the DNB GitHub that hosts our publicly available packages.

Since public authorities are not competing with each other, we regularly connect with other agencies to exchange experiences. As part of the discussions, we also offer to share our open source code (mostly Python). Initially, this has not been very successful, in part because the interests did not converge and in part because the coding language and the IT platform were not aligned. More recently, however, we have had some success with our Digital Twin project, with multiple agencies contributing to the development. In addition, we have co-organized hackathons in Singapore and Florence to foster and explore ways of cooperation in a catalyst setting.

6. Conclusion and Way Forward

New granular data sources combined with new techniques provide unique opportunities for financial and public authorities, such as central bankers and supervisors. We have shared how we take advantage of these opportunities at DNB’s Data Science Hub. A well-defined strategy, mandate, goals, and a structured way of working definitely helped us apply data science in DNB.8

Based on our experience, we present the following lessons learned:

L-1. The combination of—sometimes novel—granular data offers new insights useful for both supervisors and policymakers.

L-2. Combining internal data with external resources increases the information value of the data, allowing supervisors and policymakers to improve the incorporation of external factors and risks. A mature and widely supported framework for data governance is needed to work responsibly with the various data sources.

L-3. Automating data processes results in more efficient, accurate, and highly frequent economic (prediction) models and allows policymakers to focus on modeling rather than preparing the data. Responsible coding is needed to ensure the replication and reproducibility of analyses and policy decisions.

L-4. A data science function should be able to combine many different activities, and this requires the right mix of skills. Communication and being able to explain complex matter to the business is one important aspect as the real value of data science lies in the combination of data science techniques and domain knowledge.

L-5. Machine learning has great potential and outlier detection models have already proven to be effective innovations in supervision. Complex machine learning approaches, such as neural networks, are, however, not a prerequisite for a successful data science project. In many cases, a simple model is a great place to start.

L-6. The value of data science applications depends on the adoption by the business and therefore user-friendly interfaces to integrate the data science solution into the daily workflow is important to be able to reap the benefits.

L-7. A well-functioning IT environment is needed to facilitate data scientists, which can only be realized in close cooperation with IT.

L-8. Data science has value for the entire organization, including, for example, HR and business operations, and should therefore be in the ‘heart of the organization.’ Embracing new data science methods and using them throughout the organization requires sufficient appetite for experimentation—especially at senior levels—and therefore a common vision is key.

L-9. Last but not least, as the prior lessons show, fulfilling the conditions to make data science work is costly and takes time. The crux lies in tempering inflated expectations but also investing in projects that are more likely to pay off in the long run.

To conclude, we have three messages going forward; i) just start doing it; ii) next to the technical solutions, do not forget your clients—they are crucial for the transition toward a data-driven organization; and iii) join forces and share.

First, we have discussed what we discovered and experienced along our journey to setup the DSH. However, this was not purely our journey, but an organization-wide effort. Waiting to start with data science in an organization to first get our identified necessities in order (L6–L9) is therefore not advisable because integrating data science in an organization takes two to tango. The desires of data scientists and the organization—both the end users and IT—first need to become aligned. While at DNB we are also still working on realizing the essentials discussed in our article, we hope that this article inspires you to just start and simply work with what you have. There is certainly still room for improvement in the availability of big data and tools to process and analyze it, in the implementation of data science solutions, and in the convergence of the views on and expectations of data science. These obstacles are also experienced by other central banks and supervisors (Araujo et al., 2023). It may be frustrating to know that the full potential of a data science solution may not (yet) be unlocked due to factors beyond your influence. However, sometimes you only really know what you need until you are actually working on it.

Second, the challenges with respect to the IT environment have evolved over time. At first, most of our efforts were directed toward aligning the divergent desires of different users in combination with internal policies and security standards. More recently, our efforts have shifted to the realization of technically more challenging solutions that can stand the test of time. But even an IT environment that supports an advanced data science platform with boundless opportunities is not a guarantee for data science success. People are as important, and a transition toward a data-driven organization requires equal effort from both the end users of data science tooling-—supervisors and policymakers—and IT. At the moment, these worlds are still quite far apart. Especially when more data science solutions will find their way to implementation and will be used (and preferably maintained) by colleagues throughout the bank, it may be desirable to have the IT knowledge closer to the business and consider a BizDevOps way of working. Put differently, to succeed in embedding data science, organizations need to invest in acquiring these skills throughout the organization. Setting up a data science department is not enough.

Finally, organizations such as central banks and supervisors are all experiencing similar challenges in their journey to becoming a more data-driven organization. In his study on the emergence of SupTech, Avramović (2023) finds that the usage of technology to support supervisory processes is also related to competition between regulators. That is, regulators are duplicating each other’s efforts, trying to create the best SupTech solutions to keep up with the developments in financial markets. Avramović notes that awareness of such inefficiencies could result in more open information sharing and collaboration between regulators. Open source tooling facilitates the sharing of functionality. In the words of our board member Steven Maijoor during his speech at our data science conference: “instead of sharing shiny PowerPoint presentations, we could share the functionality that allows us to replicate the analyses of others with our own data” International coordination and collaboration could help accelerate this, and central banks are willing to join forces to reap the benefits of big data, as surveys show (Araujo et al., 2023; Di Castri et al., 2019; Doerr et al., 2021).

Some initiatives, such as the BIS Innovation Hub, have already been initiated to exchange information and collaborate. A quick win is to start sharing code (see DNB GitHub). Similarly, Araujo (2024) made a start by collecting open-sourced macroeconomic models run by central banks and other official sector agencies on GitHub.


Acknowledgments

The authors thank the Data Science Hub colleagues for their valuable work that fills the pages of this paper. The authors are also grateful to Wiebe Oechies, Pepijn Simonetti, and participants at the IFC Conference “Data Science in Central Banking” for their valuable comments.

Disclosure Statement

Patty Duijm and Iman van Lelyveld have no financial or nonfinancial disclosures to share for this article.


References

Araujo, D. (2024). Open-sourced central bank macroeconomic models. SSRN. http://dx.doi.org/10.2139/ssrn.4755247

Araujo, D., Bruno, G., Marcucci, J., Schmidt, R., & Tissot, B. (2023). Machine learning applications in central banking. Journal of AI, Robotics & Workplace Automation, 2(3), 271–293. https://hstalks.com/article/7815/machine-learning-applications-in-central-banking/

Avramović, P. (2023). Digital transformation of financial regulators and the emergence of supervisory technologies (SupTech): A case study of the U.K. Financial Conduct Authority. Harvard Data Science Review, 5(2). https://doi.org/10.1162/99608f92.7a329be7

Bank of England. (2022, October 26). Artificial intelligence and machine learning. Discussion Paper 5/22. https://www.bankofengland.co.uk/prudential-regulation/publication/2022/october/artificial-intelligence

Bricongne, J.-C., Meunier, B., & Pical, T. (2021, November 1). Can satellite data on air pollution predict industrial production? Banque de France Working Paper 847. https://www.banque-france.fr/system/files/2023-03/wp847_0.pdf

Broeders, D., & Prenio, J. (2018, May). Innovative technology in financial supervision (suptech) - the experience of early users. FSI Insights on Policy Implementation 9. https://www.bis.org/fsi/publ/insights9.pdf

Chakraborty, C., & Joseph, A. (2017, September). Machine learning at central banks. Bank of England Working Paper 674. https://www.bankofengland.co.uk/-/media/boe/files/working-paper/2017/machine-learning-at-central-banks.pdf

Chief Executives Board for Coordination. (2023, May 4). International Data Governance – Pathways to Progress. United Nations System. https://unsceb.org/sites/default/files/2023-05/Advance%20Unedited%20-%20International%20Data%20Governance%20%E2%80%93%20Pathways%20to%20Progress_1.pdf

Davenport, T., & Malone, K. (2021). Deployment as a critical business data science discipline. Harvard Data Science Review, 3(1). https://doi.org/10.1162/99608f92.90814c32

Di Castri, S., Grasser, M., & Ongwae, J. (2023). The state of suptech report. University of Cambridge.

Di Castri, S., Hohl, S., Kulenkampff, A., & Prenio, J. (2019, October). The suptech generations. FSI Insights on Policy Implementation 19. https://www.bis.org/fsi/publ/insights19.pdf

Doerr, S., Gambacorta, L., & Serena, J. M. (2021, March). Big data and machine learning in central banking. BIS Working Paper 930. https://www.bis.org/publ/work930.htm

Edge Delta. (2024, March). Breaking down the numbers: How much data does the world create daily in 2024? https://edgedelta.com/company/blog/how-much-data-is-created-per-day

Edmond, D., Prakash, V., Garg, L., & Bawa, S. (2022). Adoption of cloud services in central banks: Hindering factors and the recommendations for way forward. Journal of Central Banking Theory and Practice, 11(2), 123–143. https://doi.org/10.2478/jcbtp-2022-0016

Elliott, K., Coopamootoo, K., Curran, E., Ezhilchelvan, P., Finnigan, S., Horsfall, D., Ma, Z., Ng, M., Spiliotopoulos, T., Wu, H., & Van Moorsel, A. (2022). Know your customer: Balancing innovation and regulation for financial inclusion. Data & Policy, 4, Article e34. https://doi.org/10.1017/dap.2022.23

Harris, J. (2012, September 13). Data is useless without the skills to analyze it. Harvard Business Review. https://hbr.org/2012/09/data-is-useless-without-the-skills

Hüser, A.-C. (2015). Too interconnected to fail: A survey of the interbank networks literature. Journal of Network Theory in Finance, 1(3). http://ideas.repec.org/p/zbw/safewp/91.html

Hüser, A.-C., & Kok, K. (2019, April). Mapping bank securities across euro area sectors: Comparing funding and exposure networks. ECB Working Paper 2273. https://data.europa.eu/doi/10.2866/2648

Jansen, K. A. E. (in press). Long-term investors, demand shifts, and yields. Review of Financial Studies. https://doi.org/10.2139/ssrn.3901466

Jones, D., Snider, C., Nassehi, A., Yon, J., & Hicks, B. (2020). Characterising the Digital Twin: A systematic literature review. CIRP Journal of Manufacturing Science and Technology, 29(Part A), 36–52. https://doi.org/10.1016/j.cirpj.2020.02.002

Maijoor, S. (2022, May 13). Data science for supervision: What’s in it for us? [Speech transcript]. De nederlandsche Bank. https://www.dnb.nl/algemeen-nieuws/speech-2022/speech-steven-maijoor-data-science-for-supervision-what-s-in-it-for-us/

Mewald, C. (2023, July 13). Why data is *not* the new oil and data marketplaces have failed us. Medium. https://towardsdatascience.com/why-data-is-not-the-new-oil-and-data-marketplaces-have-failed-us-b42dd87a0ba0

Nasution, M. K. M., Sitompul, O. S., Elveny, M., & Syah, R. (2021). Data science: A review towards the big data problems. Journal of Physics: Conference Series, 1898 (1). https://doi.org/10.1088/1742-6596/1898/1/012006

Nijhuis, M. (2022, March 13). Company name matching. Medium. https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334

Nijhuis, M., & Van Lelyveld, I. (2023). Outlier detection with reinforcement learning for costly to verify data. Entropy, 25(6). https://doi.org/10.3390/e25060842

Rietveld, G., Lange, N., & Duijm, P. (2023, September 26). Measuring intra-bank complexity by (not) connecting the dots with LEI. DNB Analysis. https://www.dnb.nl/media/hiad33od/dnb-analyse-measuring-intra-bank-complexity-by-not-connecting-the-dots-with-lei.pdf

Ullersma, C., & Van Lelyveld, I. (2022). Granular data offer new opportunities for stress testing. In D. Farmer, T. Schuermann, A. Kleinnijenhuis, & T. Wetzer (Eds.), Handbook of financial stress testing (pp. 185–207). Cambridge University Press. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3546906

Van den Boom, B., Hofman, R., Jansen, K., & Van Lelyveld, I. (2021, October 7). Estimating initial margins: The COVID-19 market stress as an application. DNB Analysis. https://www.dnb.nl/media/1oul4emp/estimating-initial-margins_dnb-analysis.pdf

Van Dijk, D., & De Winter, J. (2023, March). Nowcasting GDP using tone-adjusted time varying news topics: Evidence from the financial press. DNB Working Paper 766. https://www.ssrn.com/abstract%3D4382028

World Economic Forum. (2018, August). The new physics of financial services. https://www3.weforum.org/docs/WEF_New_Physics_of_Financial_Services.pdf


©2025 Patty Duijm and Iman van Lelyveld. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
0
comment
No comments here
Why not start the discussion?