Abstract

In the past decade, online controlled experimentation, or A/B testing, at scale has proved to be a significant driver of business innovation. The practice was first pioneered by the technology sector and, more recently, has been adopted by traditional companies undergoing a digital transformation. This article provides a primer to business leaders, data scientists, and academic researchers on business experimentation at scale, explaining the benefits, challenges (both operational and methodological), and best practices in creating and scaling an experimentation-driven, decision-making culture.

Keywords: experimentation, A/B testing, product development, data-driven culture

Media Summary

Most technology companies, including Airbnb, Amazon, Bookings.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, and Uber, regularly run online randomized controlled experiments—commonly referred to as A/B tests—to evaluate new products or changes. More recently, traditional companies undergoing a digital transformation have also started to adopt the practice. However, there are still many misconceptions about the benefits that experimentation brings to an organization, the operational hurdles in adopting and scaling the practice, and the methodological challenges presented by these novel applications. This article addresses these by providing a primer on business experimentation for leaders, managers, and data scientists seeking to understand the practical implications of experimentation adoption and integration.

We begin by explaining the benefits of a single product experiment: measuring causal impact and mitigating risk. Next, we discuss how experimentation at scale accelerates the pace of innovation and enables companies to develop products that delight their customers. We then describe the operational and methodological challenges organizations face when adopting experimentation. While it is easy to run a single experiment, there are considerable operational challenges to developing the culture and technical infrastructure necessary to experiment at scale. On the methodological side, most data scientists and academics assume that business experimentation is identical to the early work on field experiments that began with R. A. Fisher and J. Neyman, among many others (Box, 1980). However, these novel applications present theoretical challenges, which require skilled data scientists and academics to recognize this new area’s nuances and work together to make progress. Finally, we conclude the article with a practical discussion on how to scale experimentation.

1. Benefits of Experimentation

In basic online experiments, often called A/B tests, users are randomly assigned to see either the current version ‘A’ (often called the control) or the new version ‘B’ (often called the treatment). After some time, the winner is determined based on the effect size, as measured by the difference between treatment and control across various business metrics (Bakshy et al., 2014; Gupta et al., 2019; Gupta, Ulanova, et al., 2018; Kohavi et al., 2013; Kohavi & Thomke, 2017; Mao & Bojinov, 2021; Tang et al., 2010; Thomke, 2020; Urban et al., 2016; Xu et al., 2015). The process is similar to medical trials, where human subjects are randomly assigned to receive either the active treatment or the placebo, and the impact on clinical outcomes determines success (Baron, 2012; Brody, 2016; Fleiss, 1999; Piantadosi, 2017; Sommer & Zeger, 1991). Beyond these basic experiments, companies regularly use sequences of A/B tests in which the new version ‘B’ is gradually released to a larger proportion of the user base; these iterative experiments are often called phased release or a controlled rollout (Xu et al., 2018). Along the way, the new version might be updated to incorporate learnings from earlier iterations or scrapped if it significantly negatively affects business metrics (Mao & Bojinov, 2021). This iterative approach is partially inspired by multi-phased clinical trials, in which the active treatment is offered to more and more people (Friedman et al., 2015; Pocock, 2013). Recently, there has also been an uptake in more complex experimental designs that allow companies to deal with violations of the basic assumptions that underpin the validity of randomized control trials; we discuss these in more detail in a subsequent section.

To quantify the value of integrating experimentation into an organization’s operations, we need to compare two hypothetical worlds: one in which the organization adopts an experimentation culture and one in which it does not. Unfortunately, we can never simultaneously observe these two worlds. So instead, we rely on our experience and learnings from multiple organizations to qualitatively describe the benefit of experimentation using a mixture of illustrative examples and personal reflections. We describe the benefits in three parts. First, we explain the value of running a single experiment. Second, we detail the additional value that adopting a culture of experimentation brings to a company. Finally, we flip our perspective to explain the benefits that customers gain from companies that foster an experimentation culture.

1.1. Benefits of a Single Experiment

1.1.1. Direct Causal Measurement of Product Changes

Experimentation is the gold standard for measuring the causal impact of a product change on a business metric (e.g., revenue, sales, or engagement). The random assignment into treatment and control groups ensures that any statistically significant observed differences in business metrics are unlikely to have been caused by factors (like seasonality, other product changes, etc.) other than the change, making it attributable to the change (Imbens & Rubin, 2015). See Figure 1 for an example. Company employees use this information to augment their decision-making regarding which changes to implement (fully release) and which ones to roll back, shifting some of the decision-making power away from the company into customers’ hands. In addition, the causal measurement provides a direct estimate of the return on investment and allows managers to quantify what proportion of growth is driven by innovation.

**Figure 1. The figure shows a product A/B test in Microsoft (Townsend, 2020).** Immediately after a release, there was a large decrease in the value of the outcomes, but there was no significant difference between the treatment and control group. Without experimentation, the analyst would have incorrectly attributed the drop in business metrics to the changes instead of external factors.

1.1.2. Risk Mitigation

Despite rigorous testing, a product change can still cause dissatisfaction among customers and have unintended consequences for the business. Experimentation allows companies to reduce the risk associated with new product launches and updates by initially decreasing the number of people who have access to the change (Mao & Bojinov, 2021; Xia et al., 2019; Xu et al., 2018). The alternative to A/B testing changes would be to release the final version to everyone without accurately measuring its causal impact. On the other hand, experimentation always comes with a simple ‘on/off’ switch and measurement capabilities that allow the company to know when something is wrong and quickly revert to an earlier version, thereby avoiding significant adverse effects. For example, in a recent study at LinkedIn, 29% of changes were deemed too risky and abandoned before they reached even 50% traffic (Xu et al., 2018).

Example (Dmitriev et al., 2017): The Microsoft News team changed the Outlook.com button on the top of the msn.com page to launch the Windows 10 mail app instead of navigating to the Outlook.com website. Initially, the change increased clicks on the mail app button—a strong positive result. However, after the initial surge, the number of clicks quickly decreased. After a careful analysis, the team realized that the initial increase was because users in the treatment arm did not know that the outlook.com button had changed; when a single click did not navigate a user to the outlook.com page, they repeatedly clicked until they discovered the new behavior. As a result, the experiment was shut down. Although it might appear that this experiment was a ‘failure,’ it successfully protected millions of customers and improved Microsoft’s understanding of its user base.

1.2. Benefits of an Experimentation Culture

Developing a culture of experimentation creates many additional benefits beyond that of a single experiment (Thomke, 2020). For an organization to fully realize these benefits, it needs a high level of humility, recognizing that we are often wrong about our customers’ wants (for example, in well-optimized domains like Bing and Google, only about 10% to 20% of changes have a positive effect on the target metrics [Kohavi et al., 2020]), and embracing the growth mindset to encourage continuous learning (Dweck, 2006).

1.2.1. Democratic, Consistent, and Streamlined Decision-Making: Data > Opinion

An experimentation culture enables a shift toward a more objective, democratic, and inclusive decision-making process in which senior leaders define the high-level metrics and strategy but devolve the product-specific decisions to the product teams. Transitioning authority from a small group of executives to more diverse teams can make organizations more innovative, intelligent, and profitable while reducing the likelihood of cognitive biases like group thinking and confirmation bias (Hunt et al., 2015; Phillips, 2014). Further, it allows for more streamlined, consistent, and fast decision-making, directly informed by the feedback gathered from real users during the experiment. Finally, defining standardized centralized metrics encourages senior leaders to identify and develop measurement capabilities that capture and direct business progress; this brings significant benefits beyond experimentation.

Example (Kohavi et al., 2020): At Bing, a software engineer had an idea to append the ads results title with the first sentence describing the result. It was one of the hundreds of ideas proposed, and the program managers deemed it a low priority. After six months of deprioritizing, the engineer decided to implement his idea in an experiment that demonstrated the change increased revenue without hurting key user-experience metrics. Without the capacity to experiment, this change would have never been made.

1.2.2. Faster, Systematic, and Confident Product Improvement: Iterative and Hypothesis Driven

Experimentation allows companies to systematically divide complex product changes into smaller modifications that can be built, tested, and released sequentially. Breaking down product changes improves organizational learning, which helps companies develop better products and quickly recover from a wrong hypothesis. The increased learning rate reduces the risk of innovation and instills more confidence and agility, increasing the speed of innovation. The iterative approach also allows companies to incorporate learnings from earlier experiments into the subsequent changes (Mao & Bojinov, 2021). The key is to determine the implementation order, which requires an understanding of which aspects of the product are independently valuable to customers and contain the most critical assumptions to test (Ries, 2011).

Example (McKinley, 2015): An Etsy product team experimented with replacing the multiple search result pages with a single page with infinite scroll. Unfortunately, the results were negative: users viewed fewer items and spent less. To understand what drove these results, the team identified the two assumptions behind the infinite scroll experiment and tested them separately. First, they tried showing more results per page and found almost no impact on engagement or sales. Then the team tried adding an artificial delay of 200 ms to test if user engagement was sensitive to performance. Again, the test did not find any effect. In the end, the team realized that these two experiments were much lower cost than building a functional infinite scroll experience and provided better insight into product direction.

1.2.3. Creating Institutional Knowledge: Leveraging Past Experiments Effectively

Another significant benefit is the buildup of institutional knowledge, collating experiments’ hypotheses and results. Having access to a repository of experiments allows employees to strengthen their intuition and emulate past successes while avoiding failures. In addition, there are several other benefits of creating strong institutional knowledge (Kohavi et al., 2020). First, it helps companies strengthen their experimentation culture by highlighting success, measuring the growth enabled by innovation, and quantifying the savings created by not releasing poor products. Second, by studying past experiments, companies can build best practice guides (Xu et al., 2018). Third, through meta-analysis (the practice of combining results of multiple experiments), data scientists can learn about the sensitivity of metrics and the relationships between various metrics—enabling a deeper understanding of the ecosystem (Bojinov, Chen, & Liu, 2020).

Example: Companies like Google, LinkedIn, and Microsoft have created institutional knowledge about the impact of change in performance on business metrics by designing a slowdown experiment. For instance, engineers at Bing created an experiment that added an artificial delay of 100 ms to the page loading times. The change resulted in a 0.6% drop in total revenue (Kohavi et al., 2014, 2020) and fundamentally changed the cost-benefit analysis of deploying new features that harmed page loading times.

1.2.4. Better Discovery of High Value Ideas: Explore the Long-Tailed Distribution of Returns Safely

Most changes a company makes to its products or services have a negligible effect on business metrics (see Figure 2 for a hypothetical illustration). For example, in a study at Microsoft, around two-thirds of all changes were found to have a negative or neutral effect on the metric they were designed to improve (Kohavi et al., 2020); in well-optimized domains, the number is even lower. Without experimentation, most of these changes would be deployed, creating a bloated product that is constantly changing without adding real customer value. Experimentation allows managers to sieve through and identify the small fraction of genuinely impactful ideas that deliver customer value while guarding against changes that lead to adverse outcomes (Gupta, Kohavi, et al., 2018). This efficient exploration of ideas can significantly increase the pace of innovation.

**Figure 2. Hypothetical distribution of return on investment for ideas generated in a company.**

1.3. Delighting the Customer

Beyond the operational benefits, experimentation enables companies to create products and services that delight their customers by using real-time feedback to augment decision-making around deployment and investment opportunities. And unlike focus groups—the traditional way of obtaining customer feedback—experimentation gathers representative feedback at scale, making it the most customer-centric tool at a company’s disposal. In addition, the feedback allows product teams to listen to the voice of a diverse customer base, enabling companies to develop more inclusive products (Saint-Jacques et al., 2020). See Figure 3 for an example.

Figure 3. The Word product team at Microsoft conducted a controlled rollout on a feature that changed the default view of Word on Android devices from the “print layout view” (left) to the “mobile view” (right) (Xia et al., 2019).

The change allocated more screen space to the document and zoomed in to make it easier to edit. However, the experiment had mixed results: the engagement metrics were only positive for users who had recently modified a document. This group was more active, but they only accounted for about 30% of the user base. In addition, there was a 13% increase in how often users switched to the non-default view within 5 s of opening the document. The feedback was incorporated into the feature redesign, resulting in a final product that displayed mobile view by default only on those documents that users previously opened and closed while in the mobile view.

2. Unique Challenges in Online Experimentation

Two types of challenges arise when using business experimentation: operational and methodological. Operational refers to the process by which experimentation is integrated into the company and used to augment decision-making. Methodological refers to areas where new theoretical research needs to be developed to progress. Note that these challenges do not need to be solved before running experiments but are some of the hurdles organizations encounter as they grow experimentation.

2.1. Operational Challenges

2.1.1. Developing the Culture and Capacity for Experimentation-Driven Product Development

Building the culture and capacity to adopt experimentation at scale requires a data platform, an experimentation platform, and an investment in people and processes. See Figure 4.

**Figure 4.** **Components of a mature experiment organization**.

2.1.1.1. The Data Platform

The data platform meets the end-to-end data needs of an organization in a compliant, efficient, reliable, secure, and scalable manner. Its primary function is to collect telemetry and process it into a standard log containing all vital data. These data are used to define standardized metrics that measure all operational aspects of the company. Specific to experimentation, the platform also needs to collect, record, and track users’ treatment assignments. Most data platforms also include analytic, reporting, and monitoring functions built on top of the centralized repository of metric definitions. As reporting and monitoring require the (costly) regular computation of metrics, the platform should be efficient and intelligent, automatically managing computing resources to meet its service-level agreement for the availability of reports while limiting the computational costs. This requires balancing the frequency and quantity of data consumed and metrics computed for different business areas and product experiments. For instance, the platform may use data sampling to reduce cost, specialized pipelines with low latency for critical metrics, or limit the number of metrics and computation frequency for a stable product area with few changes (Kohavi et al., 2020). The platform also has to be compliant, respecting the organization’s policy around data usage and legal requirements. Developing a robust data platform is frequently catalyzed by the desire to run valid experiments on stable and standardized metrics (Bojinov & Lakhani, 2020).

2.1.1.2. Experimentation Platform

The experimentation platform has three subcomponents: design, execution, and analysis (Bojinov & Lakhani, 2020; Gupta, Ulanova, et al., 2018). The design component has four steps: defining a hypothesis for an experiment, determining the primary success and guardrail metrics, selecting the target audience, and running a power analysis to determine the size of the traffic allocation (Kohavi et al., 2020). The experiment execution component ensures that the randomization scheme is executed correctly, efficiently, and persistently (a unit has the same assignment for the entire experiment duration); this can be checked by comparing the expected numbers of units in treatment with the realized (Fabijan et al., 2019). A flexible experimentation platform that supports various hypotheses should handle units of randomization beyond customers, for example, browser cookies, devices, or network clusters (Kohavi et al., 2020; S. Liu et al., 2019). The experiment platform should also be able to exclude some units from experiments (Gupta, Ulanova, et al., 2018). A smart execution engine protects customers from egregious treatments by safely ramping up traffic while building confidence that the treatment is not harmful and quickly shutting down a negative experience using near-real-time data (Xu et al., 2018). The analysis component requires standardized metric definitions (as described in the data platform section) to ensure consistency (Boucher et al., 2019), an automatic statistics engine for results computation, and result reporting and monitoring capabilities. Typically, the statistics engine includes variance reduction techniques (Deng et al., 2013; H. Xie & Aurisset, 2016), multiple hypotheses testing corrections (Deng & Alex, 2015; Y. Xie et al., 2018), and automatic detection of interaction between experiments (Gupta et al., 2019). Some experimentation platforms may also include methods like multi-arm bandits for experiments with many treatments, sequential testing for early degradation detection, and causal inference methods based on observational data (Bojinov, Chen, et al., 2020; Deb et al., 2018). Finally, the reporting and monitoring should provide results in a standardized, easily digestible manner (Gupta, Ulanova, et al., 2018) and proactively send alerts flagging quality issues or regressions.

2.1.1.3. Build vs. Buy

An early and important decision that leaders have to make when adopting experimentation is to determine if they should build or buy the necessary tools. Several factors shape this debate: costs, urgency (or time), and flexibility. The costs associated with building an experimentation platform come primarily from assembling a team of software engineers, data scientists, and user experience designers to develop and maintain the tool. On the flip side, buying an external platform has an upfront cost of integrating with the company’s existing process and a monthly fee that can depend on the number of experiments and user traffic. In terms of urgency, most external experimentation platforms can be integrated within a few weeks or months, whereas building an internal platform could take many months. The main drawback of an external platform is the level of flexibility and customization offered. For example, most platforms only offer basic A/B testing capabilities but have little support for estimating long-term impact from short-term data, handling interference among experimental units, and estimating heterogeneous treatment effects (see Section 2.2). Our experience suggests that most small companies and traditional companies undergoing a digital transformation opt to purchase external platforms, whereas established technology companies build these in-house. As covered in the next section, we recommend that an organization invest in in-house expertise for creating an experimentation culture that can evaluate the current and future experimentation needs to make the build vs. buy decision at different times based on the evolving needs of the organization.

2.1.1.4. People and Process

Having a data and experimentation platform is not enough to ensure the development of an experiment-driven culture. Companies need to invest in people and processes to create and sustain a culture where every change is tested before it is deployed (Fabijan et al., 2017, 2018; Gupta et al., 2019; Thomke, 2020). Leadership plays a crucial role in facilitating the transformation by defining product strategy and the overall evaluation criterion for a product and creating an environment where teams and individuals are empowered and incentivized to use experimentation. Initially, data scientists are needed across the entire experimentation lifecycle to ensure reliable designs and that product decision makers understand the results. As the operation scales, leadership needs to educate general employees to increase knowledge and understanding of experimentation, automate processes to reduce switching costs, and enhance trust to drive the adoption and self-service of everyday use cases. At this stage, experimentation becomes a self-service tool where data scientists provide documentation, support channels for immediate help, and offer office hours for general questions. This shift frees data scientists to focus on developing specialized experimental designs and improving the different parts of the experiment and data platform. Throughout, the company should focus on shortening the build-measure-learn cycle (Ries, 2011) by testing the most important assumptions first (based on past experiment results), holding regular experiment reviews to ensure quality, and treating all experiment results (both negative and positive) as learnings that are disseminated throughout the organization.

2.1.2. Decision Criterion

The results of experiments are used to augment decision-making around deployment, product improvement, and investment opportunities. To make the process more efficient, scalable, and transparent, we suggest that each experiment have a predefined decision criterion (a set of rules and guidelines that describe how to translate the results into decisions). As the decision criterion guides product development, it should be defined ahead of time for an entire product area but regularly reviewed and updated to ensure that it aligns with the company’s strategy (Dmitriev & Wu, 2016). In addition, a good decision criterion ensures overall alignment between stakeholders, and holds decision makers accountable for product success. Finally, it facilitates consistent and fast decision-making at scale in an organization regardless of who is in the room making the decision. A good decision criterion has five main properties, as shown in Figure 5, and relies on four categories of metrics described in Table 1 (Deng et al., 2017; Dmitriev et al., 2017; Shi et al., 2019).

**Figure 5.** **Properties of a good decision criterion** **(Gupta, Kohavi, et al., 2018; Gupta & Machmouchi, 2022; Kohavi et al., 2020).**

Table 1. A taxonomy of metrics: The example column Is for a hypothetical experiment introducing a new product.

Metric Category	Question Answered	Description	Example
Data/experiment quality	Can the experiment results be trusted?	Check if the experiment was correctly implemented.	Difference between the expected number of units in the treatment and the realized.
Overall evaluation criterion (OEC)	Did the change meet the intended goal?	The primary outcome of the experiment.	Revenue generated from the new product.
Guardrail	Did the change have any adverse effects?	The ‘do no harm’ metrics.	Total company revenue, daily active users, page loading times.
Diagnostic or feature	Why did the OEC move?	The secondary outcomes of the experiment.	Engagement with the new product, number of complaints.

There is usually an evolution of a decision criterion as the product and experimentation mature—with the addition of more data-quality, guardrail, and diagnostic metrics that provide richer insights over time about the intended and unintended impact of a change and alert against any degradations. The overall evaluation criterion (OEC) metrics, on the other hand, tend to progress through three stages, as depicted in Figure 6.

**Figure 6. Evolution** **of the overall evaluation criterion (OEC) metrics as experimentation matures.**

2.2. Unique Methodology Challenges

The foundations of the design and analysis of experiments were developed over a century ago, mostly motivated by applications in agriculture and later in medical studies (Box, 1980). Companies’ initial applications of experimentation are fairly aligned with these theoretical foundations. However, as companies run thousands of experiments on millions of connected users, tracking hundreds of outcomes to augment product decisions, new challenges start emerging (Kohavi et al., 2020). Below we detail what we consider the three most prominent and explain the latest research that has attempted to tackle these open questions (Bojinov, Saint-Jacques, et al., 2020; Gupta et al., 2019). Throughout, our focus will be on building intuition; the details can be found in the citations. Of course, companies continue to face classical problems associated with null hypothesis testing like increasing power (Deng et al., 2013; Gupta et al., 2019; Kharitonov et al., 2017; H. Xie & Aurisset, 2016), multiple hypothesis testing (Deng & Alex, 2015; Y. Xie et al., 2018), and early stopping (Deng et al., 2016; Johari et al., 2015), but we will not focus on these.

2.2.1. Estimating Long-Term Impact From Short Term Data

Most companies run experiments for less than a month, making it impossible to observe the long-term impact of changes; nevertheless, managers and leaders must make long-term strategic decisions with data collected from these short-lived experiments (Gupta et al., 2019). There are two reasons why this presents a significant challenge for organizations. First, many companies have success metrics that take months to materialize; for example, the impact of updating the ranking results in an online marketplace for accommodation (hotels/rentals) on customer satisfaction is not fully understood until customers stay in their accommodation, months after booking. Second, the user preferences and behavior may change over time as the user learns how to better interact with a new feature and an initial novelty effect wears off. For example, an experiment that increased the number of ads on a search results page may increase revenue in the first couple of weeks—the typical duration of an experiment—but might have the opposite effect months later due to user attrition and the growth of ad blindness (Dmitriev et al., 2016; Hohnhold et al., 2015). So how do we capture long-term impact from brief experiments? This open question requires new methodological contributions and advancement from the data science community.

To tackle the first challenge, we can develop surrogate metrics built on top of the principal stratification framework (Frangakis & Rubin, 2002) and require that the outcome (in our case, a long-term outcome like future revenue) is directly affected by the experiment only through the change in the surrogate. That is, if the experiment registers an effect on the surrogate, then there will be an effect on the outcome. For example, an internet company may be interested in the causal effect of a change in the user experience on long-term engagement with a specific website; possible surrogates could include detailed measures of medium-term engagement, including which of many webpages were visited and how long a user spent on each one (Athey et al., 2019). Although considerable progress has been made in this area, there are still many open questions. For example, it is hard to validate that the entire treatment effect is through a surrogate—this could be improved by making the metric more complex, but it runs the risk of reducing the interpretability, making it harder to debug and track.

To tackle user learning and the wearing-off of a novelty effect, researchers have made progress using a stepped-wedge design to approximate the long-term effects of an experiment (Basse et al., 2022; Hohnhold et al., 2015). In a stepped-wedge design, the number of experimental units in the treatment arm is gradually increased over time (e.g., 1% --> 2% --> 5 % --> 10%--> ... --> 50%). The users assigned to the treatment on day one are tracked throughout the experiment separately from the users assigned on subsequent days; then, by contrasting how the effect on these users has changed compared to the users that received the treatment at a later date provides an approximation of the long-term impact of the intervention. Basse et al. (2022) provide an optimal minimax design for these classes of experiments. However, this approach requires special experiment design and needs a longer time to measure user learning. Methods are being developed to detect and estimate novelty effect in a simple A/B test (Chen et al., 2018; Sadeghi et al., 2021). All of these approaches require assumptions around the user learning model to estimate the long-term impact. Fundamentally, to make progress, researchers must recognize that time is an integral component of some business experiments and must focus on the adoption of experimental methodologies that leverage dynamic treatment effects (Bojinov et al., 2021).

In rare cases, an organization may run a longer experiment (e.g., holdout experiments) to measure long-term outcomes directly. This approach, however, is rarely the preferred option due to fairness considerations and the complexity of maintaining product versions and tracking user units (Bojinov, Saint-Jacques, et al., 2020; Dmitriev et al., 2016).

2.2.2. Interference Among Units

There are many products and services where one person’s actions (such as posting a tweet, liking a Facebook post, or sending an email) are potentially seen by hundreds and sometimes even thousands of people. That means that if a company runs an experiment that changes some persons’ behavior, they could, in turn, have a knock-on effect on other units (even ones assigned to control); this is known as interference (Hudgens & Halloran, 2008). The challenge is exacerbated as the units of randomization are not actual people but are cookies, devices, or log-in IDs that do not have a one-to-one correspondence with a person; one person can have many such IDs and vice versa (Coey & Bailey, 2016; Gupta et al., 2019).

Interference structure may be visible (e.g., Facebook friendship graph) or not visible (e.g., a person having multiple cookies for a website) in the data. We further distinguish between three types of interference: partial interference, marketplace interference, and arbitrary interference. In partial interference, there is a subset of units for each unit whose treatment directly impacts the focal unit (Aronow & Samii, 2017). For example, in a social network, we can assume that the subset for each unit is the set of all direct connections. In marketplace interference, units are competing for the same resource. For instance, advertising auctions can be modeled as exhibiting marketplace interference as a set of companies bid for one advertising slot in each auction (M. Liu et al., 2020). Finally, arbitrary interference is the setting in which each unit’s outcome is impacted by everyone else’s treatment assignment. For instance, a promotion for drivers in a ride-sharing app to pick up passengers from an underserved area may impact the experience of other drivers and passengers across the city (Bojinov, Simchi-Levi, et al., 2020).

A common way to handle partial interference among units when the interference is well understood is to leverage a design that alleviates or accounts for interference by creating groups that reduce the interference within the groups and limit (or altogether remove) interference across the groups (Eckles et al., 2017; Ugander et al., 2013). For example, we could define groups by grouping users with high interaction rates or common characteristics (Kohavi et al., 2020). Once defined, we can vary the treatment across and within the groups (Saint-Jacques et al., 2019). Typically, the cross-group variation allows us to measure the primary treatment effect, whereas the within-group variations allow for estimating the spillover effect. The main challenge of this approach is finding groups that limit the interference while still having enough power to detect business-relevant effects.

One popular method for experimenting in the presence of marketplace interference for experiments related to online advertisement markets is the budget-splitting design (Basse et al., 2016; M. Liu et al., 2020). In this design, each advertiser’s budget is split in half, with one half assigned to the control version and the other to the treatment. The experimental users who can see the adverts are also divided between the two versions, creating two parallel marketplaces with the same size and budget. This design removes the bias but is limited by the number of experiments conducted simultaneously on the same population and doesn’t allow for a gradual rollout of new features. In addition, changes to marketplaces may take a long time to reach an equilibrium, creating an important connection between interference and measuring long-term effects from short-term data. This is an important open problem and requires new methodological approaches.

For arbitrary interference, one promising strategy is to transform the problem into temporal interference and use a panel experiment structure (Bojinov et al., 2021). For instance, consider a food delivery service that is a three-sided platform connecting restaurants to consumers through third-party delivery agents in a city, say Boston. In this example, evaluating how a matching algorithm impacts business metrics in Boston is made difficult because small changes to any side of the platform could propagate to untreated units, biasing the results of standard experiments. Instead, researchers have suggested treating Boston as the experimental unit (i.e., all users in Boston are aggregated) and randomly alternating treatment exposure over time. This class of experiments is referred to as time series experiments, of which the special class of switchback experiments (Bojinov, Simchi-Levi, et al., 2020; Bojinov & Shephard, 2019), where treatment assignment is independent of past observed data, have received the most attention.

Experiment designs that account for interference are more costly to run because of the more complicated design. Experimenter judgment is needed to understand if interference will change a deployment decision, and better a priori estimation techniques are needed for detecting interference from standard experiments. If the effect of interactions (second-order effect) is small and does not change the decision outcome, an organization may choose designs that ignore the interference.

2.2.3. Measuring Heterogeneous Treatment Effects

Most experiment analyses by companies focus on estimating the average treatment effect: the average impact of the proposed change (the treatment) compared with the current offering (the control). However, this estimand fails to capture the whole story when the treatment effect varies across individuals in the study, a phenomenon known as heterogeneous treatment effects. For instance, consider an experiment in which a professional social network started notifying its users about a job opening (Saint-Jacques et al., 2020). Suppose that users (on average) submitted one extra job application in the treatment arm—a significant win for the company if the effect was the same for everyone. However, it is possible to see the same average result if a handful of users applied to many more jobs and everyone else stopped applying—a somewhat less desirable outcome for the company. Understanding which of these scenarios is true requires looking at estimands beyond the mean. However, it is virtually impossible to assess the effect of the intervention on each unit since we only observe them either receiving treatment or control at a given time. So, instead, we focus on estimating the effect of the intervention on subgroups of units that share similar characteristics (Athey & Imbens, 2016; Sepehri & DiCiccio, 2020; Wager & Athey, 2018). For example, companies use predefined features (e.g., geographical regions, operating systems, or other user-specific features) or apply algorithms to identify such groups. Below we provide two examples where examining heterogeneous effects is critical in making progress; we then describe strategies to select appropriate groups using algorithms.

2.2.3.1. Algorithmic Fairness

The notion of algorithmic fairness is often defined by requiring algorithms to provide comparable performance across different demographic groups (Corbett-Davies et al., 2017; Saxena et al., 2019). Most researchers have focused on delivering these guarantees through offline evaluation (by looking at holdout data sets); however, the real impact of the algorithm in a real-world environment with multiple actors can only be measured through experimentation. In practice, the evaluation requires measuring the effect across many combinations of demographical features to identify groups with a different reaction. For many companies, this is challenging as they do not collect demographical information from their users, and even when they do, it is often incomplete and creates a fairness-privacy trade-off. Companies want to protect user privacy while providing fair services, but it is challenging to simultaneously do both well; it is particularly important to extend experimental methods to work with differential privacy (Ding et al., 2018; Dwork et al., 2006). One particularly useful measure is tracking the Atkinson index, a well-known measure of inequality, which quantifies the extent to which any outcome only favored the most engaged users rather than all users equally (Saint-Jacques et al., 2020).

2.2.3.2. Heavy User Bias

Experimentation results can be disproportionately driven by the actions of more engaged users (or larger customers) at the expense of less active users (or smaller customers). As a result, companies can achieve rapid growth by focusing on these heavy core sets of users (or customers) in the short term. In the long-term, however, these practices may disenfranchise less engaged users and remove a significant growth opportunity as today’s less active users are tomorrow’s heavy users. Metric definitions that weigh each user equally (e.g., percentage of users with at least one action or average actions per session instead of average actions per user) (Gupta & Machmouchi, 2022) and careful sampling techniques can help mitigate this issue (Wang et al., 2019). Further, a better understanding of the missing or sparse data is required to understand the segments of the target population that are not users of a product or less engaged today but present ample growth opportunities (Bojinov, Pillai, et al., 2020; Rubin, 1976).

Group Identification. Regardless of the motivation to detect heterogeneous treatment effects, there are two intuitive approaches to identify groups with different results. The first begins with a single group and sequentially splits the sample into subgroups that reacted similarly to the intervention but differently from everyone else (Athey & Imbens, 2016; Wager & Athey, 2018). The second begins by grouping everyone into small buckets based on their characteristics and then combines buckets that responded similarly to the intervention until there are a reasonable number of distinct groups such that, within the groups, the units responded similarly, but across groups, there are significant differences in the effect (McFowland et al., 2021; Saint-Jacques et al., 2020). Developing new methods that scale to extremely large data sets is an important problem.

Note that even though these methods allow us to detect differences, they are usually symptoms that rarely help explain the cause (Gupta et al., 2019). For example, differences in average treatment effect across geographies may be caused by differences in engagement, language, content, income levels, or infrastructure. Thus, further research and investigation are needed to understand the causes that manifest as heterogeneous treatment effects and their impact on fairness.

3. Conclusion: How to Scale Experimentation Culture

The previous sections make a case for an organization to adopt an experimentation culture by explaining the benefits to the organization and customers. Further, we have shared the critical operational and methodology challenges that leadership and data scientists will likely face and general recommendations on dealing with them. Getting to an experimentation culture at scale in an organization is a long and valuable journey. This section briefly shares a roadmap for growing the experimentation culture in an organization and getting maximum value from experimentation.

**Figure 7. Increasing the value of experimentation through quality, scale, and maturity.**

The value of experimentation culture at scale is greater than the sum of its parts. Value realization depends on three major factors, depicted in Figure 7: scale, quality, and maturity. Before being evaluated, each idea has a small chance of success. Testing more ideas quickly at scale ensures that we find more ideas that will successfully extract value. The quality of experiments ensures that we have the highest chance of detecting actionable insights from an experiment demonstrating the value of an idea. Maturity is the capacity to run quality experiments at scale (Fabijan et al., 2017).

As an organization evolves to create an experimentation culture at scale, it is vital to balance these three factors to get the most value and turn the flywheel; as the flywheel starts turning and more experiments deliver insights, the value of experimentation becomes clearer, leading to more investment in the platform, people, and process (Fabijan et al., 2020). There are three main phases in this evolution. First is the Zero-to-One problem: establish the fundamentals of running the first experiment. Next is the One-to-Many problem: creating the ability to run multiple experiments regularly. The final stage is the Many-to-Everything, where experimentation-enabled product development is the default in the organization’s culture. Table 2 summarizes the key results each component of experimentation operations (described in Figure 4) should focus on in each stage. Throughout the transformation, most companies will go through four stages (Gupta, Kohavi, et al., 2018). The first stage is characterized by hubris: we know everything about our product and do not need to run an experiment. Second, we transition to doubt as insights through experimentation start challenging our conventional understanding. Third, as these insights grow to challenge the conventional wisdom, we fall into a Semmelweis reflex (reflex-like rejection of new knowledge because it contradicts entrenched norms). It is essential when navigating these stages to use empathy to keep the product teams engaged. In the final stage, the transformation is complete, and experimentation is used in product development for continuous improvement of ideas and implementation rather than being overconfident in the concept before it is developed and deployed.

Table 2. Experimentation scaling guide.

	Crawl Zero-to-One Problem – fundamentals of running an experiment	Walk One-to-Many Problem –capability for running many experiments regularly	Run Many-to-Everything Problem - experimentation enabled product development
Data Platform	A few key product metrics and feature success metrics are defined, standardized, and automatically computed.	Standard metrics are computed automatically (including guardrail, quality, and feature success metrics).	Product-wide overall evaluation criterion is used for making decisions, and metrics cover most of the product features, usage, and segments.
Experimentation Platform	First null experiment (where both the treatment and control group receive the same version) and first few standard experiments are run.	Functionality to run multiple experiments. Automatic diagnostic test for experimentation quality. Self-service option for basic experiments.	Automation and monitoring provide the ability to test every change to the product. Full self-service is enabled.
People and Process	Every experiment is well documented and reviewed. The core team responsible for experimentation operations is formed. The feature teams to run the first experiments are identified and onboarded	Standard process in place for experiment setup, review, metric authoring, and experimentation evolution. Expertise spreads across teams and many new features teams are onboarded. Some teams begin piloting advanced topics like good metric authoring.	Decentralized process for experimentation review, metric definition, and decision making. Experiments are conducted on all product changes, and learnings from many experiments steer future direction.

Crawl
Zero-to-One Problem – fundamentals of running an experiment

Walk
One-to-Many Problem –capability for running many experiments regularly

Run
Many-to-Everything Problem - experimentation enabled product development

Data Platform

A few key product metrics and feature success metrics are defined, standardized, and automatically computed.

Standard metrics are computed automatically (including guardrail, quality, and feature success metrics).

Product-wide overall evaluation criterion is used for making decisions, and metrics cover most of the product features, usage, and segments.

Experimentation Platform

First null experiment (where both the treatment and control group receive the same version) and first few standard experiments are run.

Functionality to run multiple experiments.

Automatic diagnostic test for experimentation quality.

Self-service option for basic experiments.

Automation and monitoring provide the ability to test every change to the product.

Full self-service is enabled.

People and Process

Every experiment is well documented and reviewed.

The core team responsible for experimentation operations is formed.

The feature teams to run the first experiments are identified and onboarded

Standard process in place for experiment setup, review, metric authoring, and experimentation evolution.

Expertise spreads across teams and many new features teams are onboarded.

Some teams begin piloting advanced topics like good metric authoring.

Decentralized process for experimentation review, metric definition, and decision making.

Experiments are conducted on all product changes, and learnings from many experiments steer future direction.

Acknowledgments

We thank Paul Hamilton, Ron Kohavi, Daniel Yue, and Scott MacMillan for providing feedback on this article. We also thank all members of the experimentation community for their active participation through the publication of blogs, articles, and papers that we have referenced in this article. Finally, we are immensely grateful to the HDSR reviewers and editorial team for their comments on earlier versions of the article.

Disclosure Statement

The views and opinions expressed in this article are those of the authors only and do not represent the views, policies, and opinions of any institution or agency, any of their affiliates or employees, or any individuals recognized in the acknowledgments.

References

Aronow, P. M., & Samii, C. (2017). Estimating average causal effects under general interference, with application to a social network experiment. Annals of Applied Statistics, 11(4), 1912–1947. https://doi.org/10.1214/16-AOAS1005

Athey, S., Chetty, R., Imbens, G. W., & Kang, H. (2019). The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Working Paper 26463. NBER. https://doi.org/10.3386/W26463

Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. PNAS, 113(27), 7353–7360. https://doi.org/10.1073/pnas.1510489113

Bakshy, E., Eckles, D., & Bernstein, M. S. (2014). Designing and deploying online field experiments. In C. W. Chung (Ed.), WWW '14: Proceedings of the International World Wide Web Conference (pp. 283–292). ACM. https://doi.org/10.1145/2566486.2567967

Baron, J. (2012). Evolution of clinical research: A history before and beyond James Lind. Perspectives in Clinical Research, 3(4), 149. https://doi.org/10.4103/2229-3485.103599

Basse, G. W., Ding, Y., & Toulis, P. (2022). Minimax designs for causal effects in temporal experiments with treatment habituation. Biometrika, Article asac024. https://doi.org/10.1093/biomet/asac024

Basse, G. W., Soufiani, H. A., & Lambert, D. (2016). Randomization and the pernicious effects of limited budgets on auction experiments. PMLR, 51, 1412–1420. https://proceedings.mlr.press/v51/basse16b.html

Bojinov, I., Chen, A., & Liu, M. (2020). The importance of being causal. Harvard Data Science Review, 2(3). https://doi.org/10.1162/99608f92.3b87b6b0

Bojinov, I. I., Pillai, N. S., & Rubin, D. B. (2020). Diagnosing missing always at random in multivariate data. Biometrika, 107(1), 246–253. https://doi.org/10.1093/biomet/asz061

Bojinov, I., & Lakhani, K. (2020, October). Experimentation at Yelp. Harvard Business School Case 621–064.

Bojinov, I., Rambachan, A., & Shephard, N. (2021). Panel experiments and dynamic causal effects: A finite population perspective. Quantitative Economics, 12(4), 1171–1196. https://doi.org/10.3982/QE1744

Bojinov, I., Saint-Jacques, G., & Tingley, M. (2020, March–April). Avoid the pitfalls of A/B testing. Harvard Business Review, 48–53. https://hbr.org/2020/03/avoid-the-pitfalls-of-a-b-testing

Bojinov, I., & Shephard, N. (2019). Time series experiments and causal estimands: Exact randomization tests and trading. Journal of the American Statistical Association, 114(528), 1665–1682. https://doi.org/10.1080/01621459.2018.1527225

Bojinov, I., Simchi-Levi, D., & Zhao, J. (2020). Design and analysis of switchback experiments. SSRN. https://doi.org/10.2139/SSRN.3684168

Boucher, C., Knoblich, U., Miller, D., Patotski, S., Saied, A., & Venkateshaiah, V. (2019). Automated metrics calculation in a dynamic heterogeneous environment. arXiv. https://doi.org/10.48550/arXiv.1912.00913

Box, J. F. (1980). R. A. Fisher and the design of experiments, 1922–1926. The American Statistician, 34(1), 1–7. https://doi.org/10.1080/00031305.1980.10482701

Brody, T. (2016). Clinical trials: Study design, endpoints and biomarkers, drug safety, and FDA and ICH guidelines (2nd ed.). Academic Press.

Chen, N., Liu, M., & Xu, Y. (2018). Automatic detection and diagnosis of biased online experiments. arXiv. https://doi.org/10.48550/arXiv.1808.00114

Coey, D., & Bailey, M. (2016). People and cookies: Imperfect treatment assignment in online experiments. In J. Bourdeau & J. A. Hendler (Eds.), Proceedings of the 25th International Conference on World Wide Web - WWW ’16 (pp. 1103–1111). ACM. https://doi.org/10.1145/2872427.2882984

Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., & Huq, A. (2017). Algorithmic decision making and the cost of fairness. In S. Matwin, S. Yu, & F. Farooq (Eds.), Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Part F129685 (pp. 797–806). https://doi.org/10.1145/3097983.3098095

Deb, A., Bhattacharya, S., Gu, J., Zhou, T., Feng, E., & Liu, M. (2018, August 28). Under the hood of Uber’s experimentation platform. Uber. https://eng.uber.com/xp/

Deng, A., & Alex. (2015). Objective Bayesian two sample hypothesis testing for online controlled experiments. In A. Gangemi & S. Leonardi (Eds.), Proceedings of the 24th International Conference on World Wide Web - WWW ’15 Companion (pp. 923–928). ACM. https://doi.org/10.1145/2740908.2742563

Deng, A., Dmitriev, P., Gupta, S., Kohavi, R., Raff, P., & Vermeer, L. (2017). A/B testing at scale: Accelerating software innovation. In N. Kando, T. Sekai, & H. Joho (Eds.), SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1395–1397). ACM. https://doi.org/10.1145/3077136.3082060

Deng, A., Lu, J., & Chen, S. (2016). Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing. In Proceedings - 3rd IEEE International Conference on Data Science and Advanced Analytics, DSAA 2016 (pp. 243–252). IEEE. https://doi.org/10.1109/DSAA.2016.33

Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In S. Leonardi & A. Panconesi (Eds.), Proceedings of the Sixth ACM International Conference on Web Search and Data Mining - WSDM ’13 (pp. 123–132). ACM. https://doi.org/10.1145/2433396.2433413

Ding, B., Nori, H., Li, P., & Allen, J. (2018). Comparing population means under local differential privacy: With significance and power. arXiv. http://arxiv.org/abs/1803.09027

Dmitriev, P., Frasca, B., Gupta, S., Kohavi, R., & Vaz, G. (2016). Pitfalls of long-term online controlled experiments. In 2016 IEEE International Conference on Big Data (Big Data) (pp. 1367–1376). IEEE. https://doi.org/10.1109/BigData.2016.7840744

Dmitriev, P., Gupta, S., Kim, D. W., & Vaz, G. (2017). A dirty dozen. In S. Matwin, S. Yu, & F. Farooq (Eds.), Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1427–1436). ACM. https://doi.org/10.1145/3097983.3098024

Dmitriev, P., & Wu, X. (2016). Measuring Metrics. In S. Mukhopadhyay & C. Zhai (Eds.), Proceedings of the 25th ACM International on Conference on Information and Knowledge Management - CIKM ’16 (pp. 429–437). ACM. https://doi.org/10.1145/2983323.2983356

Dweck, C. S. (2006). Mindset: The psychology of success. Random House.

Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., & Naor, M. (2006). Our data, ourselves: Privacy via distributed noise generation. In S. Vaudenay (Ed.), Lecture notes in computer science: Vol 4004. Advances in cryptology - EUROCRYPT 2006 (pp. 486–503). Springer. https://doi.org/10.1007/11761679_29

Eckles, D., Karrer, B., & Ugander, J. (2017). Design and analysis of experiments in networks: Reducing bias from interference. Journal of Causal Inference, 5(1). https://doi.org/10.1515/jci-2015-0021

Fabijan, A., Arai, B., Dmitriev, P., & Vermeer, L. (2020, December 28). It takes a flywheel to fly: Kickstarting and keeping the A/B testing momentum. Microsoft Research. https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/it-takes-a-flywheel-to-fly-kickstarting-and-keeping-the-a-b-testing-momentum/

Fabijan, A., Dmitriev, P., McFarland, C., Vermeer, L., Holmström Olsson, H., & Bosch, J. (2018). Experimentation growth: Evolving trustworthy A/B testing capabilities in online software companies. Journal of Software: Evolution and Process, 30(12), Article e2113. https://doi.org/10.1002/smr.2113

Fabijan, A., Dmitriev, P., Olsson, H. H., & Bosch, J. (2017). The evolution of continuous experimentation in software product development: From data to a data-driven organization at scale. In Proceedings - 2017 IEEE/ACM 39th International Conference on Software Engineering, ICSE 2017 (pp. 770–780). IEEE. https://doi.org/10.1109/ICSE.2017.76

Fabijan, A., Gupchup, J., Gupta, S., Omhover, J., Qin, W., Vermeer, L., & Dmitriev, P. (2019). Diagnosing sample ratio mismatch in online controlled experiments. In A. Teredesai & Vipin Kumar (Eds.), Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD ’19 (pp. 2156–2164). ACM. https://doi.org/10.1145/3292500.3330722

Fleiss, J. L. (1999). The design and analysis of clinical experiments. John Wiley & Sons. https://doi.org/10.1002/9781118032923

Frangakis, C. E., & Rubin, D. B. (2002). Principal stratification in causal inference. Biometrics, 58(1), 21–29. https://doi.org/10.1111/J.0006-341X.2002.00021.X

Friedman, L., Furberg, C., DeMets, D., Reboussin, D., & Granger, C. (2015). Fundamentals of clinical trials . Springer. https://doi.org/10.1007/978-3-319-18539-2

Gupta, S., Kohavi, R., Deng, A., & Raff, P. (2018). A/B testing at scale tutorial. Strata 20. https://exp-platform.com/2018StrataABtutorial/

Gupta, S., Kohavi, R., Tang, D., & Xu, Y. (2019). Top challenges from the first Practical Online Controlled Experiments Summit. ACM SIGKDD Explorations Newsletter, 21(1), 20–35. https://doi.org/10.1145/3331651.3331655

Gupta, S., & Machmouchi, W. (2022, April 6). STEDII properties of a good metric. Microsoft Research. https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/stedii-properties-of-a-good-metric/

Gupta, S., Ulanova, L., Bhardwaj, S., Dmitriev, P., Raff, P., & Fabijan, A. (2018). The anatomy of a large-scale experimentation platform. In 2018 IEEE International Conference on Software Architecture (ICSA) (pp. 1–109). IEEE. https://doi.org/10.1109/ICSA.2018.00009

Hohnhold, H., O’Brien, D., & Tang, D. (2015). Focusing on the long-term: It’s good for users and business. In L. Cao & C. Zhang (Eds.), Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015-Augus (pp. 1849–1858). ACM. https://doi.org/10.1145/2783258.2788583

Hudgens, M. G., & Halloran, M. E. (2008). Toward causal inference with interference. Journal of the American Statistical Association, 103(482), Article 832. https://doi.org/10.1198/016214508000000292

Hunt, V., Layton, D., & Prince, S. (2015, January 1). Diversity matters. McKinsey & Company, 15–29. https://www.mckinsey.com/business-functions/people-and-organizational-performance/our-insights/why-diversity-matters

Imbens, G. W., & Rubin, D. B. (2015). Causal inference: For statistics, social, and biomedical sciences an introduction. Cambridge University Press. https://doi.org/10.1017/CBO9781139025751

Johari, R., Pekelis, L., & Walsh, D. J. (2015). Always valid inference: Bringing sequential analysis to A/B testing. arXiv. https://doi.org/10.48550/arXiv.1512.04922

Kharitonov, E., Drutsa, A., & Serdyukov, P. (2017). Learning sensitive combinations of A/B test metrics. In M. de Rijke & M. Shokouhi (Eds.), Proceedings of the Tenth ACM International Conference on Web Search and Data Mining - WSDM ’17 (pp. 651–659). ACM. https://doi.org/10.1145/3018661.3018708

Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., & Pohlmann, N. (2013). Online controlled experiments at large scale. In R. Ghani, T. E. Senator, & P. Bradley (Eds.), Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’13 (pp. 1168–1176). ACM. https://doi.org/10.1145/2487575.2488217

Kohavi, R., Deng, A., Longbotham, R., & Xu, Y. (2014). Seven rules of thumb for web site experimenters. In S. Macskassy & C. Perlich (Eds.), Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’14 (pp. 1857–1866). ACM. https://doi.org/10.1145/2623330.2623341

Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy online controlled experiments: A practical guide to A/B testing. Cambridge University Press.

Kohavi, R., & Thomke, S. (2017). The Surprising Power of Online Experiments. Harvard Business Review, 95(5), 74–82. https://hbr.org/2017/09/the-surprising-power-of-online-experiments

Liu, M., Mao, J., & Kang, K. (2020). Trustworthy online marketplace experimentation with budget-split design. arXiv. https://doi.org/10.48550/arXiv.2012.08724

Liu, S., Fabijan, A., Furchtgott, M., Gupta, S., Janowski, P., Qin, W., & Dmitriev, P. (2019). Enterprise-level controlled experiments at scale: Challenges and solutions. In Proceedings - 45th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2019 (pp. 29–37). IEEE. https://doi.org/10.1109/SEAA.2019.00013

Mao, J., & Bojinov, I. (2021). Quantifying the value of iterative experimentation. arXiv. https://doi.org/10.48550/arXiv.2111.02334

McFowland III, E., Gangarapu, S., Bapna, R., & Sun, T. (2021). A prescriptive analytics framework for optimal policy deployment using heterogeneous treatment effects. MIS Quarterly, 45(4), 1807–1832. https://doi.org/10.25300/MISQ/2021/15684

McKinley, D. (2015). Data-driven products: Correcting for human nature in product development. O’Reilly Media.

Phillips, K. W. (2014). How diversity works. Scientific American, 311(4), 42–47. https://doi.org/10.1038/scientificamerican1014-42

Piantadosi, S. (2017). Clinical trials: A methodologic perspective. John Wiley & Sons.

Pocock, S. J. (2013). Clinical trials. John Wiley & Sons. https://doi.org/10.1002/9781118793916

Ries, E. (2011). The lean startup: How today’s entrepreneurs use continuous innovation to create radically successful businesses. Crown Business.

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. https://doi.org/10.2307/2335739

Sadeghi, S., Gupta, S., Gramatovici, S., Lu, J., Ai, H., & Zhang, R. (2021). Novelty and primacy: A long-term estimator for online experiments. arXiv. https://doi.org/10.48550/arXiv.2102.12893

Saint-Jacques, G., Sepehri, A., Li, N., & Perisic, I. (2020). Fairness through experimentation: Inequality in A/B testing as an approach to responsible design. arXiv. https://doi.org/10.48550/arXiv.2002.05819

Saint-Jacques, G., Varshney, M., Simpson, J., & Xu, Y. (2019). Using ego-clusters to measure network effects at LinkedIn. arXiv. https://doi.org/10.48550/arXiv.1903.08755

Saxena, N. A., Radanovic, G., Huang, K., Parkes, D. C., DeFilippis, E., & Liu, Y. (2019). How do fairness definitions fare? Examining public attitudes towards algorithmic definitions of fairness. In V. Conitzer, G. Hadfield, & S. Vallor (Eds.), AIES 2019 - Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (pp. 99–106). ACM. https://doi.org/10.1145/3306618.3314248

Sepehri, A., & DiCiccio, C. (2020). Interpretable assessment of fairness during model evaluation. arXiv. https://doi.org/10.48550/arXiv.2010.13782

Shi, X., Dmitriev, P., Gupta, S., & Fu, X. (2019). Challenges, best practices and pitfalls in evaluating results of online controlled experiments. In A. Teredesai & Vipin Kumar (Eds.), Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD ’19 (pp. 3189–3190). ACM. https://doi.org/10.1145/3292500.3332297

Sommer, A., & Zeger, S. L. (1991). On estimating efficacy from clinical trials. Statistics in Medicine, 10(1), 45–52. https://doi.org/10.1002/sim.4780100110

Tang, D., Agarwal, A., O’Brien, D., Meyer, M., Brien, D. O., Meyer, M., O’Brien, D., & Meyer, M. (2010). Overlapping experiment infrastructure. In B. Rao & B. Krishnapuram (Eds.), Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’10 (pp. 17–26). ACM. https://doi.org/10.1145/1835804.1835810

Thomke, S. (2020). Experimentation works: The surprising power of business experiments. Harvard Business Press.

Townsend, J. (2020, June 26). A/B testing and Covid-19: Data-driven decisions in times of uncertainty. Microsoft. https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/a-b-testing-and-covid-19-data-driven-decisions-in-times-of-uncertainty/

Ugander, J., Karrer, B., Backstrom, L., & Kleinberg, J. (2013). Graph cluster randomization: Network exposure to multiple universes. In R. Ghani, T. E. Senator, & P. Bradley (Eds.), Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Part F128815 (pp. 329–337). ACM. https://doi.org/10.1145/2487575.2487695

Urban, S., Sreenivasan, R., & Kannan, V. (2016). It is all A/Bout testing: The Netflix experimentation platform. Medium. https://Medium.Com/Netflix-Techblog/Its-All-a-Bout-Testing-the-Netflixexperimentation-Platform-4e1ca458c15

Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242. https://doi.org/10.1080/01621459.2017.1319839

Wang, Y., Gupta, S., Lu, J., Mahmoudzadeh, A., & Liu, S. (2019). On heavy-user bias in A/B testing. In W. Zhu, D. Tao, & X. Cheng (Eds.), CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management (pp. 2425–2428). ACM. https://doi.org/10.1145/3357384.3358143

Xia, T., Bhardwaj, S., Dmitriev, P., & Fabijan, A. (2019). Safe velocity: A practical guide to software deployment at scale using controlled rollout. In Proceedings - 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2019 (pp. 11–20). IEEE. https://doi.org/10.1109/ICSE-SEIP.2019.00010

Xie, H., & Aurisset, J. (2016). Improving the sensitivity of online controlled experiments. In B. Krishnapuram & M. Shah (Eds.), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16 (pp. 645–654). ACM. https://doi.org/10.1145/2939672.2939733

Xie, Y., Chen, N., & Shi, X. (2018). False discovery rate controlled heterogeneous treatment effect detection for online controlled experiments. In 2018 IEEE PELS Workshop on Emerging Technologies: Wireless Power Transfer (Wow) (pp. 876–885). IEEE. https://doi.org/10.1109/WoW.2018.8450883

Xu, Y., Chen, N., Fernandez, A., Sinno, O., & Bhasin, A. (2015). From Infrastructure to culture: A/B testing challenges in large scale social networks. In L. Cao & C. Zhang (Eds.), Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Figure 1 (pp. 2227–2236). ACM. https://doi.org/10.1145/2783258.2788602

Xu, Y., Duan, W., & Huang, S. (2018). SQR: Balancing speed, quality and risk in online experiments. In Y. Guo & F. Farooq (Eds.), Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 895–904). ACM. https://doi.org/10.1145/3219819.3219875

©2022 Iavor Bojinov and Somit Gupta. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Online Experimentation: Benefits, Operational and Methodological Challenges, and Scaling Guide