Skip to main content
SearchLoginLogin or Signup

Analytics Challenges and Their Challenges

Published onMay 24, 2024
Analytics Challenges and Their Challenges
key-enterThis Pub is a Commentary on

Professor Donoho (2024) has written a comprehensive and insightful article filled with practicality and optimism for the future of data science. He does a magnificent job of ‘connecting many dots’ across the many dimensions of computing, data science, and all things ‘data.’ I would like to highlight some aspects of the article that are particularly appealing to me, while also injecting some real-world difficulties with his thesis.

If I had to summarize this article for someone without a background on modern trends in data science (and the perseverance to consume a vast array of interesting examples and emerging trends in this article), it would be this:

Public, analytical competitions are a mechanism for advancing the science of analyzing data and making predictions. Furthermore, such competitions need to have common data sets from which to operate and the sharing of code for the analytical method(s). If all of this is done in the public forum, there will be an acceleration of innovation, since using common data provides a collective framework, sharing code allows for immediate modifications to test for improved predictions, and the competition is an objective way to ‘score’ what methods are providing the best predictive results.

The author describes this as “frictionless reproducibility” with three key elements aligned to the summarized thesis above:

[FR-1: Data] Datafication of everything, with a culture of research data sharing;

[FR-2: Re-execution] Research code sharing including the ability to exactly re-execute the same complete workflow by different researchers;

[FR-3: Challenges] Adopting challenge problems as a new paradigm powering methodological research.

If the author were to agree with this distillation of his article, then I am firmly aligned with this thinking, though there are some notable hurdles for such an environment, and I harbor some skepticism about its broad applicability.

Donoho provides many data science examples of these elements—in whole or in part—and how their use has led to rapid and substantial progress in the application of machine learning (ML) in various areas. Some notable examples are machine vision, natural language processing, and protein folding. As a career statistician in the pharmaceutical industry who has ‘dipped my toes in the data science waters,’ I will use my comments to focus on my sweet spot—the medical/health care/drug development arena.

Frictionless Reproducibility and Statistical Science

The notion of using online competitions for scientific and computing problems is a fairly recent phenomenon. One very early example (the earliest?) is the launch of InnoCentive in 2001 (now known as Wazoku) as cited by the author, which was designed to accept a very broad range of challenges from science to manufacturing. With the launch of Kaggle in 2010, this challenge approach to advancing algorithms to solve societal problems has become more mainstream in the data science community. This approach has not been pursued in the statistical community to any great extent. I am thankful to the author for citing the research I have done with my former Lilly colleagues that we hosted on the InnoCentive platform for finding better statistical methods for subgroup identification (Ruberg, Zhang, et al., 2023). By that, I mean searching across many baseline covariates that describe a patient with the hopes of finding covariates that are predictive of exceptional response—be it a very good efficacy response, a very bad efficacy response, or an adverse event.

This is a very important problem in drug development and clinical research in general—perhaps even the Holy Grail. If we could measure the right set of baseline covariates on a patient and predict whether they will respond to a treatment or not, or have a serious adverse event or not, the medical community would be well on its way to so-called personalized medicine. Of course, there are treatments for which we know that a certain genotype or biomarker predisposes a patient to better or worse response, but that is the exception rather than the rule.

There are several points to make from this InnoCentive challenge. Perhaps the most notable distinction from other data science challenges is that a large number of clinical data sets were generated from a model with known parameters. Thus, ‘ground truth’ was known and submitted algorithms for predicting subgroups of exceptional responders using baseline covariates could be compared and scored versus the true subgroup defined by the generative model.

The second notable distinction was the generation of 1,200 data sets, which allowed for an assessment of reproducibility. An algorithmic solution might get the right predictive subgroup in one data set, but that same algorithm may not be generalizable to other data sets with different generative models, even generative models in the same class with a simple change in one parameter. That is, a data scientist could use a highly parameterized ML model to fit (overfit?) a single data set very well (i.e., predict precisely which patients are in the subgroup of exceptional responders and which patients are not), but that model could perform poorly on other data sets provided in this challenge.

Lastly, the 1,200 generated data sets also included data sets for which there was no subgroup; that is, all patients came from the same generative model. By including such data sets, it allowed us to identify those submitted solutions that were overfitting the data or making false positive subgroup predictions—Type 1 errors if you will.

It is worth noting that the vast majority of the 700+ submissions to our InnoCentive challenge presented models that were either unevaluable or were no better than flipping a coin for identifying whether a patient was in the true subgroup or not. Such submissions appeared to suffer from gross overfitting or a lack of attention to the multiplicity problem, that is, with many baseline covariates to choose from, the chances of a spurious correlation with the clinical outcome is quite large. Misidentification of patients can have dire consequences, and this challenge points to the importance of evaluating various methodologies systematically and objectively.

I believe this is playing out on a larger scale in clinical medicine when it comes to clinical predictive algorithms—ML/AI algorithms used to diagnosis a patient’s condition or predict whether a patient is at risk of progressing to a disease state (i.e., prognosis). Quite often, data science ‘laboratories’ select specific clinical data sets based on convenience or availability from their clinical collaborators and apply their favorite ML/AI technology to search for ‘digital biomarkers.’ Findings that are published in very notable journals, but are not reproducible, are rampant, with few successful digital medical products advancing into broad clinical use (Rudin et al., 2023; Sperrin et al., 2020; Wynants et al., 2020). In this sense, I do not believe data science to be at a singularity but rather at a fork in the road—with one fork representing business as usual and the other being a more rigorous approach to developing and validating algorithms for patient care (Ruberg, Menon, & Demanuele, 2023). In an editorial in The Lancet, the editors noted, “Without a clear framework to differentiate efficacious digital products from commercial opportunism, companies, clinicians, and policy makers will struggle to provide the required level of evidence to realise the potential of digital medicine” (“Is Digital Medicine Different?” 2017). The challenge paradigm is one such framework to develop, test, and improve algorithms for health care.

I highlight these fundamental distinctions to most data science challenges because in some situations, we are merely trying to describe the data and do model-fitting. In Breiman’s (2001) mindset, it does not matter if the model is correct or incorrect (i.e., these terms do not even have meaning); it is only if the model fits and is useful for predictions. In other situations—say medicine, biology, toxicology—there is some underlying truth of nature, some cause-and-effect relationships. In that case it is important, if not imperative, to discover/uncover/describe that state of nature as best we can. I think the data science community could benefit from more extensive and rigorous challenges that use generative models, including control groups and multiple data sets that are both separate realizations from the same generative model and generative models with different parameterizations.

With that said, I believe that the statistical community has not engaged with the notion of challenges nearly enough as a mechanism to advice statistical methodology, let alone fully engaging in frictionless reproducibility. Perhaps because InnoCentive was born through the efforts of some researchers and entrepreneurs at Lilly, our statistics group took that approach for a few external challenges but also for internal challenges in which we posed problems to our statistics department and other quantitative scientists to see who might have the best approach toward solving a problem. The statistics and data science group at Novartis has also embraced the challenge approach for internal methodology development and assessment with some notable successes (Bornkamp et al., 2024; Bretz & Greenhouse, 2023, section 5.5).

On a grander scale, I would like to see the statistics community embrace challenges (FR3) at a minimum and frictionless reproducibility in general. This idea has been recognized by some members of the statistical community as evidenced by the recent special collection in the Biometrical Journal titled “Towards Neutral Comparison Studies in Methodological Research” (Boulesteix, 2024). In the editorial introducing the special collection, the editors state, “Consequently, meaningful comparisons of competing approaches (especially reproducible studies including publicly available code and data) are rarely available and evidence-supported state of the art guidance is largely missing, often resulting in the use of suboptimal methods in practice.” The special collection is a step in rectifying this situation.

Furthermore, there has been a protracted outcry for more transparency of clinical trial data, and the pharmaceutical industry has responded with some solutions such as Vivli, Project Data Sphere, and My perspective is that the response has been tepid. My personal experience is that it is a cumbersome process to get the data from such not-for-profit organizations, and there are restrictive requirements for use and further sharing of the data. Commercial aggregators of clinical trial data abound but, of course, are far from the frictionless ideal of free and open data sharing.

In summary, data science has embraced the challenge paradigm and even in some cases full frictionless reproducibility (the author calls this “frictionless exchange”), leading to beneficial outcomes. The medical world, at least in the arena of predictive algorithms for diagnosis and prognosis, has had far less success. Perhaps a more frictionless approach would serve that community well, coupled with the use of generative models so that any proposed algorithm can be judged or scored versus a ground truth. While the statistics profession may lay claim to doing many simulation studies using ground truth as a reference, it is most certainly lagging in adoption of broader frictionless approaches, at least in the drug development or medical realm. There are some emerging signs of embracing the frictionless paradigm (e.g., Biometrical Journal special collection on neutral comparisons), but much more could be done.

Barriers to Frictionless Exchange

I applaud the ideal of the frictionless exchange in which data, code, and results measured or scored in a consistent manner creates an open environment for the free exchange of ideas to create better and better solutions for societal problems. While the ideal seems to be playing out in some areas, I am skeptical about the general applicability of the frictionless exchange.

We live in an information society, and more and more value in business and academia is being driven by ideas and intellectual property in general. In my home field of drug development, pharmaceutical companies exist and thrive on patents of new molecular entities, but many argue that the real value is not the molecule, which can be quite simple to make, but the intellectual endeavors that go into describing the properties of that molecule for medical use through extensive preclinical and clinical research. In fact, when a company does a $100-million clinical trial, at the end of the trial, the patients return to their homes, the investigators return to their practices, and all that is left is a pile of data. So, in a very real sense, that data cost $100 million and the efficacy and safety information contained therein is the basis for regulatory approval and marketing. It is understandable that a company wants to guard that data carefully and share it sparingly. Although the costs are less in preclinical or laboratory research, research data can be even more closely guarded because it contains the seeds of current and future research that can lead to patented new molecular entities—the lifeblood of pharmaceutical companies.

An interesting example that the author uses as a success story in the pharmaceutical industry is the exceptionally rapid development of COVID-19 vaccines, a feat that was on par with the building of the Hoover Dam or landing a man on the moon, only done in a fraction of the time. The author notes, “Key enablers of this rapidity were [FR-1] data sharing (of the virus’s RNA sequence) and widespread [FR-2] code sharing (of algorithms that could analyze and translate that sequence in various ways).” Perhaps it is worth noting that FR-3 (challenges) was implicitly in play as there was a very public competition (race) to see who could get to the market first and thereby reap enormous rewards (financial, societal, reputational, scientific). That public competition included each company involved knowing where the other companies stood as they published preliminary clinical trial results, posted their clinical trial designs on, and so on. Also, public meetings, FDA panels, and so on openly discussed criteria for success of a vaccine (minimum 30% vaccine efficacy, which is a precisely defined statistical metric based on the clinical trial data). Again, albeit not a formal challenge as described in the article, it was certainly an informal challenge environment that was created through the free market. Note that those who decided not to participate in this open market—notably China and Russia—developed far inferior vaccines.

The global COVID-19 pandemic was an exceptional circumstance. I do not know in detail what other industries face, but I suspect that there are concerns about sharing data under normal circumstances or on a routine basis. Data and information are the new currency of business and even academia. The same might be said about code. Code is the proprietary product of many companies and giving it away for free may not serve the best interest of their investors and employees.

Finally, the author argues that in a frictionless exchange where data, code, and results are open, free, and transparent, rapid advances can be made. But our social constructs are not necessarily aligned with such an environment. When new solutions emerge from the frictionless exchange, who gets the credit? Who gets the financial reward for commercializing a solution? Who gets the promotion? Who gets the next round of research funding? Who gets the prestigious invite to speak at a scientific conference? Who wins the award? I perceive that our present (and perhaps future) social constructs are a significant barrier to widespread adoption of a frictionless exchange. However, I do not say that to be fatalistic. I encourage implementing of any parts of FR1, FR2, and FR3 wherever possible (and even in places where it might appear impossible) and stretching our mental and social constructs to embrace more of the ideas that Professor Donoho espouses in his article.

Disclosure Statement

Stephen J. Ruberg has no financial or non-financial disclosures to share for this article.


Bornkamp, B., Zaoli, S., Azzarito, M., Martin, R., Müller, C. S., Moloney, C., Capestro, G., Ohlssen, D., & Baillie, M. (2024). Predicting subgroup treatment effects for a new study: Motivations, results and learnings from running a data challenge in a pharmaceutical corporation. Pharmaceutical Statistics. Advance online publication.

Boulesteix, A.-L., Baillie, M., Edelmann, D., Held, L., Morris, T. P., & Sauerbrei, W. (2024). Editorial for the special collection “Towards neutral comparison studies in methodological research.” Biometrical Journal, 66(2), Article 2400031.

Breiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16(3), 199–231.

Bretz, F., & Greenhouse, J. B. (2023). The role of statistical thinking in biopharmaceutical research. Statistics in Biopharmaceutical Research, 15(3), 458–467.

Donoho, D. (2024). Data science at the singularity. Harvard Data Science Review, 6(1).

Is digital medicine different? [Editorial]. (2018). The Lancet, 392(10142), 95.

Ruberg, S., Menon, S., & Demanuele, C. (2023). Resolving the credibility crisis: Recommendations for improving predictive algorithms for clinical utility. Harvard Data Science Review, 5(3).

Ruberg, S., Zhang, Y., Showalter, H., & Shen, L. (2023). A platform for comparing subgroup identification methodologies. Biometrical Journal, 66(1), Article 2200164.

Rudin, C., Guo, Z., Ding, C., & Hu, X. (2023, October 11). How good are AI health technologies? We have no idea. STAT+ News.

Sperrin, M., Grant, S. W., & Peek, N. (2020). Prediction models for diagnosis and prognosis in Covid-19. BMJ, 369, Article m1464.

Wynants, L., Calster, B. V., Collins, G. S., Riley, R. D., Heinze, G., Schuit, E., Bonten, M. M. J., Dahly, D. L., Damen, J. A., Debray, T. P. A., Jong, V. M. T. de, Vos, M. D., Dhiman, P., Haller, M. C., Harhay, M. O., Henckaerts, L., Heus, P., Kammer, M., Kreuzberger, N., … van Smeden, M. (2020). Prediction models for diagnosis and prognosis of covid-19: Systematic review and critical appraisal. BMJ, 369, Article m1328.

©2024 Stephen Ruberg. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

1 of 15
No comments here
Why not start the discussion?