Skip to main content
SearchLoginLogin or Signup

Resolving the Credibility Crisis: Recommendations for Improving Predictive Algorithms for Clinical Utility

Published onOct 27, 2023
Resolving the Credibility Crisis: Recommendations for Improving Predictive Algorithms for Clinical Utility
·
history

You're viewing an older Release (#1) of this Pub.

  • This Release (#1) was created on Oct 27, 2023 ()
  • The latest Release (#2) was created on Feb 27, 2024 ().

Abstract

The promise of artificial intelligence (AI) and machine learning (ML) for improving clinical outcomes has been marked by modest successes, but also many failures. Access to data and to advanced algorithms makes developing AI/ML tools relatively easy—but the real challenge is demonstrating their clinical utility. Clinical predictive algorithms (CPAs)—algorithms for diagnosis or prognosis—are failing to meet their potential for two key reasons. First, when developers assess the quality of CPAs, they tend to rely too heavily on area under the receiver operating characteristic curve (AUC), ignoring the prevalence of the characteristic of interest, which is critical to understanding the positive and negative predictive value of the CPA. We propose a more holistic approach to optimizing the development of a CPA, as has been taken with bioanalytical diagnostic tests and physiologic biomarkers for decades. Second, insufficient emphasis has been placed on performing rigorous clinical trials to quantify the benefits and risks of implementing a CPA. We propose a fit-for-purpose, sequential clinical development approach analogous to development of new medicinal products, biomarkers, and other complex interventions, in some cases overseen by regulatory agencies. With these two recommendations met, the benefits and risks of a CPA will be clearer, resulting in greater credibility, transparency, and utility.

Keywords: clinical predictive algorithms, machine learning, artificial intelligence, clinical outcomes, drug development, biomarker development


1. Introduction

Health care providers are increasingly making diagnoses and prognoses using artificial intelligence (AI) and machine learning (ML) algorithms. For the past 15 years, the use of these clinical predictive algorithms (CPAs) has been widely reviewed and debated in scientific literature (Engelhard et al., 2021; “Is Digital Medicine Different?” 2017; Kappen et al., 2018; Keane & Topol, 2018; Kelly et al., 2019; Landers et al., 2021; Nagendran et al., 2020). Unfortunately, many of these predictive algorithms do not work well. Though there has been progress and some success in medical image processing (Choy et al., 2018; McKinney et al., 2020; Ouyang et al., 2020; Shen et al., 2016), reports of the shortcomings or downright failing of CPAs abound (Christodoulou et al., 2019; DeMasi et al., 2017; Kim et al., 2019; Michelessi et al., 2017; Nagendran et al., 2020; Panch et al., 2019; Pouwels et al., 2016). For example, a recent review of 232 diagnostic or prognostic algorithms developed in the rush of the COVID-19 pandemic deemed none of them fit for clinical use (Sperrin et al., 2020; Wynants et al., 2020). We see two solutions to this problem, both of which rely on time-honored scientific and statistical principles and practices that are too often forgotten or ignored in the modern digital world.

The first solution is to develop, implement, interpret, and optimize CPAs using the same principles that have long governed bioanalytical/laboratory assays and physiologic biomarkers. We argue that CPA developers rely too heavily on area under the receiver operating characteristic curve (AUC) (Ghaderzadeh & Asadi, 2021; Roberts et al., 2021) when evaluating the quality of their algorithms, and that they must consider prevalence (P) of the disease and the resulting positive and negative predictive values (PPV, NPV) to assess the utility and value of the CPA. We also recommend quantifying the value of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) decisions to ascertain the total value (TV) of the CPA.

The second solution is for CPA development to follow a fit-for-purpose phased pathway similar to that of biomarker development, drug development, and complex interventions—from development of the intervention itself through larger confirmatory trials and, ultimately, real-world implementation (“Artificial Intelligence in Health Care,” 2018; Kelly et al., 2019; Park et al., 2020).

Note that this article will not debate the most suitable predictive modeling approach for CPA development—whether it be a traditional statistical method such as logistic regression or any of the variety of ML approaches. The systematic approach we propose for developing CPAs will undoubtedly lead to more useful predictive models, regardless of the chosen analytical method underlying the CPA and the optimization criteria for that model. We also will not directly address fundamental issues regarding the underlying data on which CPAs are built. Most notably, these include the widely reported bias that can occur with CPAs due to inadequate data inputs to build and ‘validate’ a CPA (Obermeyer et al., 2019; Schwartz et al., 2022), and the emerging recognition of data shift; that is, the data used to build a CPA may differ from the data on which the CPA is deployed (e.g., selection bias in the training data set) as well as the inevitable drift of data that can occur in evolving medical practice (Dockès et al., 2021; Zhang et al., 2022). However, we believe our proposal for more rigorous clinical trials will help in this regard by having more clinical centers involved when testing the deployment of a CPA.

We believe in the potential of CPAs to improve health and reduce costs, and we fear that the current profusion of low-quality CPAs may drive an entire generation of clinicians to lose confidence in these promising tools. While the processes we are recommending may be expensive and time-consuming, we believe they are essential to creating CPAs that clinicians and patients can trust and use appropriately.

2. Optimizing CPAs for Clinical Utility

AUC has long been used as a measure of the performance of a diagnostic test, with higher AUCs indicating better model performance. AUC for a diagnostic test is derived from a graphical display of sensitivity versus (1-specificity) for all possible cut-off values of the CPA output, which is usually some sort of score or probability for having the disease or prognosis of interest (Fawcett, 2006). Figure 1 displays a simple set of data depicting the CPA probability or score for two sets of patients—one set with the diagnosis or prognosis of interest (blue circles) and one set without that diagnosis or prognosis (orange circles)—which is used for illustrative purposes in this Panorama. The plausible cut-off values range from CL, which achieves a sensitivity of 1 while minimizing false positives, to CU, which achieves a specificity of 1 while minimizing false negatives. Moving the cut-off from CL to CU will create a new sensitivity and specificity at each patient observation, resulting in 21 possible (sensitivity, specificity) pairs. When implementing a CPA, a single predictive cut-off value C must be chosen, which in this case produces a sensitivity = 0.75 and specificity = 0.80. Supplementary Table 1, CPA Performance as a Function of the Cut-off, C, displays the full range of sensitivities and specificities for all possible cut-offs in Figure 1.

Figure 1. A schematic illustrating different cut-off values for a hypothetical diagnostic test. The choice of cut-off determines the sensitivity and specificity of the diagnostic test, and we use C for this illustration. Blue circles represent patients who have the characteristic of interest. Orange circles represent patients who do not have the characteristic of interest. TP (true positive), FP (false positive), TN (true negative), and FN (false negative) are based on cut-off C. CL and CU are the smallest and largest logical cut-off values, respectively. That is, CL maximizes the specificity for a sensitivity of 1, and CU maximizes the sensitivity for a specificity of 1.

When implementing a CPA, its predictive ability is directly dependent on the prevalence (P) of the disease or prognosis in question. The prevalence determines the positive predictive value (PPV) and negative predictive value (NPV). Sensitivity and specificity are determined by the developer’s algorithm/model and choice of the cut-off value and are essential elements of the construction of the CPA for a patient population of interest. These metrics are inherently retrospective. PPV and NPV depend on prevalence, which is nature’s choice, or at least something beyond the developer’s control, and are inherently prospective. Prevalence and the resulting PPV and NPV are essential for interpreting a CPA result for an individual patient.

These relationships are depicted in Table 1 in the familiar 2 x 2 confusion matrix, where, for illustrative purposes, we have used P = .10. It is a well-known characteristic of diagnostic tests that as prevalence decreases, PPV decreases (i.e., the number of false positives outweighs the number of true positives). Thus, for rare conditions, it is easier to predict who does not have the characteristic of interest and more difficult to predict who has it. These concepts are also described in the context of the COVID-19 pandemic (Waller & Levi, 2021) and, as noted earlier, may have helped the developers of the many failed COVID-19 CPAs create better and more useful algorithms for clinical use.

Table 1. The confusion matrix for the example data depicted in Figure 1, using the cut-off C and assuming a prevalence of 10%.

Truth

Positive
Prevalence = 0.1

Negative
(1-Prevalence) = 0.9

Predictive Value

Diagnostic Test Result

Positive

Sensitivity = 0.75
TP = Se*P*N

(1-Specificity) = 0.20
FP = (1-Sp)*(1-P)*N

PPV = 0.29

Negative

(1-Sensitivity) = 0.25
FN = (1-Se)*P*N

Specificity = 0.80
TN = Sp*(1-P)*N

NPV = 0.97

Note. N = total number of cases evaluated; P = prevalence; Se = sensitivity; Sp = specificity; TP = number of true positives; FP = number of false positives; FN = number of false negatives; TN = number of true negatives; PPV = positive predictive value = TP / (TP+FP); negative predictive value = TN / (TN+FN).

Understanding prevalence is critical, as it defines PPV and NPV. Different care settings will undoubtedly have different underlying prevalence for the characteristic of interest. The prevalence of patients with sepsis in a major metropolitan hospital, for example, will be different than in a smaller community hospital, necessitating a different cut-off to maximize the value of implementing the CPA. Because prevalence plays such an important role, we believe PPV and NPV should be routinely reported, and displays of PPV and NPV are particularly revealing regarding the implementation of the CPA (Figure 2). Accuracy, the weighted average of sensitivity and specificity (Fawcett, 2006), is plotted in Figure 2 and lies between PPV and NVP for a given cut-off. Because it is also dependent on prevalence, reporting accuracy, as is sometimes done, in the absence of prevalence, is not particularly helpful.

Figure 2. Measures of diagnostic or prognostic performance for various cut-off values and prevalence of the characteristic of interest. Positive predictive value (PPV), negative predictive value (NPV) and accuracy are derived based on the example data depicted in Figure 1. Integer cut-off values are defined by the patient observations in Figure 1 and in practice would be an explicit probability or score that is an output of the clinical predictive algorithms (CPA). See Supplementary Table 2, CPA Total Value as a Function of Cut-off Values and Prevalence, for details.

For optimizing the utility of a diagnostic test, we believe it is not only necessary to identify the probability of various test outcomes, but also to identify the value and costs of the decisions based on these outcomes and then to incorporate those values and costs into the assessment of a CPA. Thus, the goal should be to optimize

Total Value (TV) = PPV*VTP + (1-PPV)*VFP + NPV*VTN + (1-NPV)*VFN,

where Vi is the value associated with each outcome. The correct weights to use in this utility function are the probabilities related to the predicted patient characteristics/outcomes from the diagnostic test or CPA. Quite often, reference is made to the cost of FP and FN diagnoses, but we argue that the benefit of TP and TN diagnoses must also be considered. Thus, Vi may be a positive value (savings) or negative value (cost) to the health care system where the diagnostic test is employed. This is the general approach taken by the U.S. Preventative Services Task Force, which, for example, recommends routine mammography screening to begin at age 50, but rightly notes, “Women who place a higher value on the potential benefit than the potential harms may choose to begin biennial screening between the ages of 40 and 49 years” (Siu, 2016).

For illustration only, suppose the following values have been defined (ignoring units, but presumably expressed in a relevant metric, such as cost but can also be in time saved or improved survival):

VTP = 10; VFP = −15; VTN = 5; VFN = −20.

A VTP = 10 reflects that correctly identifying a patient who truly has the characteristic of interest (the probability of which is the PPV) is valuable, since proper and early identification can lead to appropriate treatments or other interventions for the benefit of the patient. A VFP = −15 reflects the cost of falsely identifying a patient as having the characteristic (with probability 1-PPV) and the unnecessary procedures or harms that may come to such a patient. A VTN = 5 reflects correctly identifying a patient without the characteristic of interest and the benefits of allowing that patient to forego interventions. A VFN = −20 reflects the cost of missing a patient who truly has the characteristic and allowing that patient to proceed untreated, perhaps suffering worsening disease and disability (see Figure 3 and Supplementary Table 2, CPA Total Value as a Function of Cut-off Values and Prevalence, for more results derived from varying the cut-off, C).

Figure 3. Total value as a function of cut-off values and prevalence. The cut-off values are discrete steps based on the patients depicted in Figure 1. Each cut-off defines a different sensitivity and specificity of the clinical predictive algorithms (CPA). When combined with prevalence, this determines the positive predictive value (PPV) and negative predictive value (NPV) of the CPA as shown in Table 1. Total Value is then computed for each cut-off assuming the values of Vi noted in the text.

Figure 3 illustrates some important considerations that, while specific to this hypothetical data from Figure 1 and assumed Vi, apply generally to this problem. It is worth noting that many CPAs, and diagnostic tests in general, select cut-off values that balance good sensitivity and good specificity (e.g., sensitivity of 0.85 and specificity of 0.80). However, as our illustration shows, for the same cut-off (C), TV can differ dramatically depending on P. At the lowest cut-off value (CL in Figure 1), there are no FP decisions (specificity = 1), and the only incorrect decisions are FN (sensitivity = 0.60). When prevalence is high (e.g., 0.75), using such an extreme cut-off value can be quite valuable (TV = 12); when prevalence is low (e.g., 0.1), there is a cost to the system (TV = −5). At the other extreme cut-off (CU in Figure 1), the situation is reversed. In fact, when using cut-off value of 8 (see Supplementary Table 1), sensitivity = 0.85 and specificity = 0.76 (accuracy = 0.80), yet the TV is quite negative for P = 0.1. For P = 0.1, it is only when sensitivity = 0.5 and specificity = 1.0 (i.e., at CU) that the TV is maximized. It is worth noting that in many clinical situations, CPAs are targeted at less common, difficult-to-diagnose conditions and therefore prevalence is inherently low. It behooves the developer to plot TV as in Figure 3 and select a cut-off that could be most useful for initiating clinical trials to test the CPA.

The values (Vi) for each outcome can be clinical in nature, such as improved survival or reduced complications, or economic in nature, such as reduced costs/resource utilization or even incorporate patient preferences, such as quality of life (Moons et al., 1997; Pauker & Kassirer, 1980; Vickers & Elkin, 2006). They are unique to the disease state, the patient population, and the clinical context for which the CPA is being considered. These are difficult quantities to estimate, but clinicians generally make decisions balancing their intuitive understanding of the risks and benefits of their actions. Pauker and Kassirer (1980, p. 1112) note that making these quantities “explicit should be viewed as a strength and not as a weakness of this analytic approach.”

It is not helpful to discuss the omnibus measure of AUC for model performance, nor the right cut-off probability value for creating an alert, or the right sensitivity related to that cut-off value, or even the PPV associated with prevalence without understanding the benefits of a correct diagnosis and the costs of an incorrect diagnosis. If the cost of a false negative is very high (e.g., mortality, severe morbidity, lengthy hospital stay) relative to a false positive diagnosis (e.g., unnecessary treatments or procedures), then a hospital system may actually benefit from a higher false positive rate, even at the risk of alert fatigue in a clinical setting. As noted previously, the largest TV can occur at the extremes of sensitivity = 1 or specificity = 1, depending on prevalence (Figure 3). This is why it is imperative to study clinical outcomes under the conditions of use (i.e., within the system workflow) for determining the utility of a CPA.

The Total Value approach taken here is a generalization and extension of some approaches taken in the past (Moons et al., 1997; Pauker & Kassirer, 1980; Vickers & Elkin, 2006). Pauker and Kassirer provide specific examples in renal vasculitis/malignant hypertension and gastric cancer using clinical utilities (similar to what we describe as TV) when deciding to treat a patient, perform additional diagnostic testing, or to leave the patient untreated. Moons et al. (1997) describe such an approach using logistic regression for diagnosing pulmonary embolism, and they note that defining the benefits-costs ratio “could be based on experience of practicing physicians or on general medical knowledge derived from clinical trials and studies of cost-effectiveness” (p. 452). Creating some measure of clinical utility or value can be difficult and may require some assumptions or judgments, but the concept has been available for many decades in the medical decision sciences literature related to diagnostic testing. These concepts do not appear to have found footing among CPA developers, who tend to claim success and utility with a sufficiently high AUC. As Vickers and Elkin (2006) emphasize, “The AUC metric focuses solely on the predictive accuracy of a model. As such, it cannot tell us whether the model is worth using at all or which of 2 or more models is preferable. This is because metrics that concern accuracy do not incorporate information on consequences” (p. 565).

The optimization process we propose is depicted in Figure 4, where the goal is to maximize the TV of the CPA. AUC is only the starting point of the process. We believe full optimization requires both an iterative and sequential approach through the appropriate selection of C for given values of P and Vi. Thus, we recommend that the process of optimizing a CPA follow a fit-for-purpose, sequential path that is akin to the process of developing biomarkers, pharmaceutical products, or other complex interventions. We will explore this concept in detail in the next section.

Figure 4. Schematic for the iterative process of optimizing a clinical predictive algorithm (CPA). AUC = area under the receiver operating characteristic curve; C = cut-off value from the algorithm output for declaring a patient as positive or negative for the diagnostic or prognostic characteristic of interest; P = prevalence; PPV = positive predictive value; NPV = negative predictive value; Vi are the values/costs assigned to a true positive decision, a false positive decision, a true negative decision, and a false negative decision.

3. A Phased Approach for CPA Development

We do not have to reinvent the wheel to make CPAs successful. The key is to recognize that CPAs are clinical interventions that are meant to improve patient outcomes. Thus, we can improve the development, validation, and implementation of CPAs by adopting the processes and principles of other medical interventions. The development of new pharmaceutical treatments, the process for validating predictive biomarkers as outlined by the Early Detection Research Network (EDRN) (Feng & Pepe, 2020; Pepe et al., 2001), and the framework for studying complex interventions (Skivington et al., 2021) all rely on phased approaches that can serve as a template for CPA developers. The procedures established for these interventions generally follow five incremental phases—identification, development, feasibility, evaluation, and implementation—with each phase requiring greater investment in time, rigor, and cost, with the benefit of increased validity, reliability, and generalizability of the final product (Park et al., 2020). In many circumstances, these processes are embedded in a regulatory framework that provides independent assessment and an imprimatur of credibility (“Artificial Intelligence in Health Care,” 2018; DeMasi et al., 2017; Nagendran et al., 2020; Park et al., 2020; Wu et al., 2021). Ultimately, they culminate in a final ‘product’ (e.g., a regulatory drug label, a validated biomarker, a new health care policy) that clearly describes benefits and risks under specific conditions of use.

While clinical research trials for therapeutics, biomarkers, and complex interventions can be lengthy due to the recruitment of patients and extended follow-up times to assess clinical efficacy and safety outcomes, we envision rigorous clinical trials that can be completed more quickly for CPAs due to plentiful patients in hospital settings and generally shorter follow-up times to assess patient outcomes of interest. Table 2 aligns drug development, biomarker development, and complex intervention research with analogous elements of CPA development that would produce credible results commensurate with these other medical interventions. What follows is a brief overview of those five phases for CPAs.

Table 2. A proposed framework for CPA development, with analogies to the development processes of other medical interventions. (Table scrolls horizontally for additional columns. Scroll bar available at the bottom of the table.)

Phase

Pharmaceutical Drug Development

Predictive Biomarker Development (e.g., EDRN)
(Pepe et al., 2001)

Complex Intervention Research
(Skivington et al., 2021)

Clinical Predictive Algorithms (CPAs)

Identification

Identify unmet medical need

Identify diagnostic/prognostic need

Identify existing/planned health policy

Identify diagnosis/prognosis need (e.g., automation benefit)

Identify biological pathways/targets

Identify potentially useful biomarkers

Identify data sources to support evaluation

Identify data sources & explore modeling approaches

Create & test candidate molecules in laboratory models

Prioritize lead candidates

Identify contextual factors, outcome measures

Define revised workflow in patient care

Development

Select best candidate

Select best bioanalytical assay approach

Develop system map to guide evaluation

Select appropriate data sets and modeling approach

Determine formulation/delivery

Conduct clinical assay validation

Develop program theory to assess uncertainties & outcomes

Perform PK/PD modeling to predict drug action

Assess/borrow features from other successful interventions

Develop model on retrospective databases

Conduct Phase 1 safety study in healthy volunteers

Classify patients with and without cancer on retrospective databases

Test algorithm in healthcare setting but outside patient care

Estimate best doses for Phase 2; identify safety concerns

Estimate AUC, TP, false reporting rate

Estimate social benefits

Explore cut-off (C); estimate AUC; identify safety concerns

Feasibility

Conduct multicenter Phase 2 RCT assessing efficacy & safety in patients

Conduct retrospective analysis of cases and controls

Conduct feasibility study, including evaluation criteria

Conduct multicenter CRTs in patient care, including clinical workflow

Create rigorous, specific protocol & SAP

Create well-defined protocol & SAP

Define rigorous investigational plan with relevant outcomes

Define protocol & SAP with surrogate or clinical outcomes

Examine safety signals; explore dose response

Evaluate predictive ability in patient samples

Evaluate cost-benefit

Estimate TV given prevalence by varying cut-off C (see Figure 3)

Characterize ‘the right patient’ for Evaluation phase; update Investigator Brochure

Define ‘screen positive’ rule for Evaluation phase

Refine intervention for Evaluation phase

Fine-tune algorithm for Evaluation phase; Draft Model Facts Sheet

Evaluation

Conduct AWCT Phase 3 confirmatory trials, using rigorous protocol & SAP

Conduct prospective screening study with defined outcomes and SAP

Generate evidence using pre-specified investigational plan and statistical analysis

Conduct prospective, broad CRTs with clear protocol (workflow) & SAP

Account for global regulatory requirements & local practice

Account for cancer prevalence and adherence to screening

Account for context or confounding factors

Account for local prevalence

Establish benefit-risk

Quantify detection rate and false referral rate

Measure intervention impact on individuals & health system

Establish appropriate clinical outcomes (e.g., survival, AEs), benefit-risk and TV

Define product label

Recommend process for implementation

Influence policy recommendations & decisions

Finalize Model Facts Sheet

Implementation

Marketing to physicians/ patients/payers as approved by regulators

Conduct prospective study of screened and unscreened patients

Drive to increase uptake and impact of intervention

Embed CPA in EHS for appropriate setting(s)

Conduct Phase 4 real-world studies

Evaluate real-world clinical outcome (e.g., mortality)

Perform cluster randomized trials or pragmatic clinical trials

Monitor clinical outcomes from implementation of CPA

Evaluate new populations/indications

Evaluate cost of screening and compliance with screening plan

Evaluate intervention transferability to other settings

Conduct studies for utility/TV in other populations or settings

Validate utility / CER Health economic assessment

Explore other screen protocols or treatments for screen positive patients

Estimate use/uptake; evaluate economic impacts

Perform comparative algorithm research using the TV metric

Note. EDRN = Early (Cancer) Detection Research Network; CPA = clinical predictive algorithm; PK/PD = pharmacokinetic / pharmacodynamic; AUC = area under the curve; TP = true positives; RCT = randomized controlled trial; CRT = cluster randomized trials; SAP = statistical analysis plan; TV = total value; AWCT = adequate well-controlled trial; AEs = adverse events; EHS = electronic health system; CER = comparative effectiveness research.

3.1. Identification

Regardless of the intervention, the development process starts with identifying an area of need. In pharmaceutical research, that amounts to identifying a disease state that is inadequately treated or for which there is no treatment. For example, Type 2 diabetes has many pharmaceutical treatment options but remains a prevalent disease for which researchers continue to seek better treatments to control the disease and its deleterious downstream consequences. On the other hand, numerous diseases, especially rare genetic diseases, simply have no treatment options. For predictive biomarker research in oncology, the EDRN conducts ongoing research to improve biomarkers and their assays for earlier detection of cancers of many types. Health policy research continues to address areas of poor public health or emerging public health issues that may benefit from new or better health policy interventions.

Similarly, CPA development begins with identifying areas for improvement in medical diagnosis or prognosis, presumably for conditions with the largest health consequences. This includes assessing where the benefits of automation and algorithms can improve patient outcomes and corresponding financial benefits. In this phase, priorities can be established for the most promising workflows in patient care, and a fit-for-purpose strategy for CPA development selected. This is akin to pharmaceutical companies working with regulatory agencies to define a clinical development program—that is, a sequence of clinical trials—such that the totality of evidence at the end of the development program could demonstrate the efficacy and safety of the pharmaceutical treatment for a specific disease and specific patient population. At this stage, it is important to identify data sources and modeling approaches to consider for the diagnostic/prognostic situation.

3.2. Development

The development of any intervention includes an early phase of testing and assessment. In pharmaceutical drug development, this consists of creating candidate molecules in the research lab and identifying those with the best efficacy and safety characteristics in animal models. The best candidate is selected, and initial human testing is done in small-scale Phase 1 trials that are generally done outside the context of clinical care. Similarly, biomarker research usually evaluates existing biological samples or data sources and performs retrospective analysis using newly developed assays. This work allows for the validation of assays while assessing their performance on past data. Analogous features exist in complex intervention research. In each case, this early development phase produces data to inform subsequent refinements and testing of the intervention. In pharmaceutical drug development, it is the selection of appropriate doses and patient populations (e.g., mild versus severe patients) for Phase 2 studies; in biomarker assay validation, it is the selection of appropriate cut-off values to define relevant true positive and false positive rates as well as PPV and NPV; in health policy research, it involves assessing other complex interventions and appropriate features in order to estimate potential social benefits.

In this stage of CPA development, developers would explore and select relevant data sources, with careful consideration for hidden biases or other limitations. For example, a CPA for acute kidney injury was developed using the US Veterans Affairs database (Tomašev et al., 2019) and was, not surprisingly, unsuccessful in its implementation in a hospital in the UK health system (Connell et al., 2019). Furthermore, a widely used algorithm underlying pulse oximeters was created using a sample of patients that lacked racial diversity (Sjoding et al., 2020). This led to racial disparity in the delivery of care for suspected COVID-19 patients where blood oxygen levels play a crucial role in determining care (Fawzy et al., 2022). Pulse oximetry has come under intense scrutiny, including FDA meetings to address the issue and define a regulatory pathway for approval of pulse oximeters for clinical care (U.S. Food and Drug Administration, 2022b).

As with biomarker research, CPA developers would then apply the selected algorithmic approach (i.e., digital assays) retrospectively, often refining the algorithms using various tuning parameters in the underlying models to achieve maximal accuracy or AUC. Next, developers would assess the CPA in a small number of clinical settings (e.g., hospitals), independent of patient care decisions, to ensure its application is safe, both in the context of erroneous decisions (i.e., FP, FN) and human implementation. Researchers could set procedures in place to identify safety issues resulting from CPA use (i.e., any system malfunction such as repeated false alarms or null outputs, or stark deviations of CPA output from standard clinical judgments). Since clinical context matters, various cut-off values, analogous to dose escalation for a medicinal product, may be used to analyze data to learn what might have happened had it been used in patient care. Based on the results of this stage, refinements to the algorithm and the clinical workflow can be established for further investigation.

3.3. Feasibility

In other areas of interventional research, feasibility studies are conducted in the practical setting of interest. In pharmaceuticals, Phase 2 clinical trials evaluate the new treatment in patients with the disease of interest (usually double-blind with a control group) to evaluate dosage, clinical measures of effectiveness, and, of course, safety. In biomarker research, this phase involves evaluating cases and matched control patients to evaluate the predictive ability of the biomarker. In complex interventions involving health policy, feasibility studies explore evaluation criteria or suitable metrics of social outcomes. In all of these, prespecified and rigorous statistical analysis plans (SAPs) are devised to account for missing data, multiplicity of measurements, or other data anomalies that undoubtedly arise in such research. The goal of this stage is to refine the understanding of the intervention and the populations to which it will be applied to optimize the intervention as a precursor to more definitive or confirmatory testing. Importantly, the primary outcome measures in this stage of investigation may be a surrogate endpoint or a clinical outcome, the selection of which can depend on time and cost of the feasibility study.

The same approach for CPAs fits into the cycle of optimization depicted in Figure 3. At this early stage, the primary outcome measure may be a surrogate outcome (e.g., time to antibiotic administration) or a meaningful clinical outcome (e.g., survival or a measure of reduced morbidity). The financial cost and human cost of misdiagnosis or misuse of the CPA is an important aspect of learning and refining its use in subsequent clinical trials. To investigate and quantify the benefits and risks of the CPA on such outcomes, cluster randomized trials (Campbell & Walters, 2014; Murray et al., 2020) may be useful (Textbox 1), in which hospitals are randomized to either implement the new CPA within their workflow or to use their existing workflows and decision-making process. This is also the initial opportunity to identify and remediate any issues related to data shift when assessing the deployment of the CPA in clinical care. Furthermore, rigorous clinical trials include the notion of suitable inclusion/exclusion criteria for the hospitals in a cluster randomized trial. Such hospitals would include community hospitals, private hospitals, research hospitals, rural and inner-city hospitals. By the nature of using a CPA within the hospital setting, all relevant patients in that hospital will be included in the evaluation of the CPA. Thus, such trials will naturally include a broad and representative population of patients and allow for assessment of bias in various populations.

Textbox 1. Definition of cluster randomized trials.

In cluster (or group) randomized trials (CRTs), clinics, hospitals, or physicians (i.e., clusters), are randomized rather than individual participants, and patient-level outcomes are then measured within each cluster (Campbell & Walters, 2014; Murray et al., 2020). Randomizing at the hospital/physician level, where each hospital/physician is exclusively using one system, reduces the risk of contamination (e.g., participants sharing the same physician may be treated similarly because of the influence of the clinical predictive algorithm system on the physician) (Heagerty & DeLong, 2017). In these studies, inclusion/exclusion criteria are set for both clusters (e.g., hospital size, hospital setting, and geographical location) and patients. The inclusion/exclusion criteria at the cluster level determine the external validity of the study (Pladevall et al., 2008). Such trials require different statistical analysis methods to account for the cluster effects and correlations within clusters (Campbell & Walters, 2014; Murray et al., 2020).

Different cut-off values would be implemented according to the prevalence of the disease at each hospital. Well-defined protocols carefully describe the implementation and the collection of data, and hospital personnel would be trained in the use of the CPA. Analyses based on a detailed and well-defined SAP (International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use Steering Committee, 1998) are helpful in creating credible and interpretable results. At this early stage in development, the choice of C and observed estimates of P would be evaluated, and the resulting PPV and NPV used in models of TV based on assessments of Vi that are obtained from this clinical trial and other data sources. Ideally, the CPA would be updated or refined for use in confirmatory trials so that it is broadly applicable across as many hospital environments as envisioned for the CPA. With clinical outcome data available, developers would be able to create the initial version of a Model Facts Sheet (Sendak et al., 2020), which is a brief, standardized description of when and how the model would be used in practice. With a fit-for-purpose approach, which is detailed in the Discussion section, this may be sufficient for clinical testing and moving to the implementation phase.

3.4. Evaluation

With the intervention successfully tested in smaller-scale settings and fine-tuned based on initial data, it is ready for full-scale testing. For other interventions, this evaluation most often involves prospective application of the intervention according to an exquisite protocol, data collection plan, and SAP with a suitable control group for making unbiased estimates of the intervention effect. The SAP often includes analyses of secondary outcomes and sensitivity analyses to evaluate the impact of assumptions on the results. In pharmaceuticals, these Phase 3 trials are the basis for regulatory submissions worldwide, and if sufficient efficacy and safety are demonstrated, for regulatory approval that includes a product label defining the conditions of use for the treatment. Similar rigor is used for biomarker validation and complex interventions resulting in formal recommendations for the process of implementation for biomarkers and policy recommendations for health policy interventions. Adverse events are always a primary consideration in pharmaceutical research, and unintended negative consequences or undesirable outcomes are studied in other interventional research.

The evaluation of a CPA in practice may benefit from being conducted in larger multicenter trials to appropriately assess its efficacy (correct diagnosis/prognosis), safety (incorrect diagnoses/prognoses), and TV when compared to existing systems or human decision-making processes (i.e., the control group). Zhang et al. (2022) have noted that to address data shift, “large, well-designed, well-labelled, diverse and multi-institutional datasets … are critical” particularly for “mitigating racial and socioeconomic biases.” Of critical importance is the selection of an appropriate clinical outcome that represents a potential improvement in patient care. As with clinical trials of pharmaceutical agents, there can and should be secondary measures of patient benefit as well as benefits to the hospital and health care system. Furthermore, negative outcomes, adverse events, and misdiagnoses are important elements of assessing benefit-risk of CPA implementation. Using randomization, blinding, and a suitable control group to make cause-and-effect inference about the CPA plus existing systems versus existing systems alone adds to the scientific credibility of any findings.

The selected cut-off (C) would be defined for each hospital/setting based on prevalence (P) and the relationship to Total Value (TV) determined from the feasibility phase. This adds some complexity to the definition of such trials but can be incorporated into the protocol. Furthermore, each institution may have different workflows that require some adaptability of the protocol. The flexible definition of C and customization of the workflow provides the best opportunity to optimize the CPA and also demonstrates how successful implementation can be generalized to other institutions not involved in the clinical trial. Consistent, comprehensive training of investigators and site staff on any new clinical workflow processes or decision-support systems helps to evaluate the practical use of the CPAs. A well-defined SAP established before the initiation of the study is beneficial for ensuring that the design of the trial meets its objectives as well as reducing post hoc data analysis bias. Again, cluster randomized trials could be considered. These CPA confirmatory studies might be conducted in different types of hospitals to ensure the CPA tool is generalizable and configurable and can be widely used in geographically and socioeconomically diverse clinical care systems. Analogous to Phase 3 trials in drug development, we envision this to be the basis for a regulatory submission, review, and, if deemed clinically effective under the conditions of use, approval. As with a regulatory approved drug label, the Model Facts Sheet would be updated with relevant clinical trial information (i.e., context) and estimated benefits and risks on clinical outcomes.

3.4. Implementation

Research and evaluation continue when an intervention is adopted in real-world settings. Medical interventions are compared to other standards of care, and safety in the broader patient population is tracked. Interventions are also investigated for use in other populations or related diseases, or in the case of devices or surgical procedures, for updated technology. Predictive biomarker implementation evaluates the uptake and cost of screening, as well as enhanced screening protocols to maximize the benefit of the biomarker in clinical practice. Complex interventions also evaluate transferability of the policy to other settings, estimate uptake of the recommended policy, and, importantly, the economic impact of the intervention.

CPAs that are deployed at scale, whether approved formally by regulators or not, require ongoing monitoring in real-world settings, as the assumptions and the data used in the initial development of the algorithms may change over time or regionally. For example, data shift clearly occurred during the COVID-19 pandemic as disease prevalence continued to change due to natural, behavioral, societal, and environmental causes such as the emergence of variants, pandemic fatigue, and the introduction of vaccines (US Centers for Disease Control and Prevention, 2022). Furthermore, ongoing monitoring and observational research can formally assess biases that may have crept unknowingly into the process leading to inequitable care delivery. Lastly, researchers can expand a CPA’s utility to other populations or other clinical environments or compare utility and TV of different CPAs using clinical trials or observational studies. For example, assessing whether a CPA designed for mild disease works in severe cases and addressing what must be adjusted in its implementation may be important (e.g., using different cut-off values based on different prevalence rates for mild disease versus severe disease). The same can be said for extrapolation from adult to pediatric populations or extension from an intensive care unit to other hospital settings. In each case, defining clinical outcomes that represent a benefit to patients or the health care system is appropriate, as is defining a process for collecting undesirable, harmful, and more costly outcomes resulting from the use of the CPA in real-world settings.

As with drug development, biomarker qualification, and complex interventions, objectives and questions must be clearly defined for the research at each phase. Reliable answers to those key questions arise from detailed protocols related to workflows for using the CPA, well-defined primary and secondary clinical outcome measures, and prespecified, precise statistical analysis plans. Those answers will determine whether a product, in this case the CPA, requires more research in the current phase, advances to the next phase, or is abandoned for lack of utility.

4. Barriers to Adoption

We often hear that 20% of human problems derive from technical matters and 80% derive from social considerations. The recommendations in the previous section have well-known and established technical solutions; changes in mindset, strategy, and behavior represent the biggest barriers to realizing these recommendations. We can learn from history in this regard.

These recommendations may seem burdensome, but what is happening today in the CPA arena is analogous to the era before the Kefauver-Harris Amendment to the Food Drug and Cosmetic Act (1962), which the FDA uses for development, review, and approval of new drugs and devices. At that time, no clear standards existed for demonstrating efficacy of a new treatment, and the FDA had much weaker authority to regulate new products. The tipping point for increased regulation came from the thalidomide disaster (Vargesson, 2015), in which thalidomide was approved in some countries outside the United States for use in treating morning sickness in pregnant women. Its use caused a dramatic increase in incidence of newborns with severely deformed limbs, sparking an international outcry for greater drug development regulations. When the Kefauver-Harris Amendment was enacted, it included a requirement that developers produce “substantial evidence” of the effectiveness of a drug. That amendment defines “substantial evidence” as evidence that consists of:

adequate and well-controlled investigations, including clinical investigations, by experts qualified by scientific training and experience to evaluate the effectiveness of the drug involved, on the basis of which it could fairly and responsibly be concluded by such experts that the drug will have the effect it purports or is represented to have under the conditions of use prescribed, recommended, or suggested in the labeling or proposed labeling thereof.

The law generated many hearings and vigorous debates in the halls of Congress (Harris, 1964) and at medical meetings about blinding investigators in clinical trials, the use of placebo control groups, and other design features that we now consider routine today. The notion of complex, prolonged randomized, blinded, placebo-controlled clinical trials initially met with resistance, but notable medical leaders argued that such trials were an “indispensable ordeal,” necessary for sorting the wheat from the chaff (Fredrickson, 1968). The pharmaceutical industry, in conjunction with regulatory agencies and academia, has since developed a highly advanced system of adequate and well-controlled trials (AWCT) leading to remarkable advances in medicinal treatments and other interventions to improve patient health, well-being, and longevity.

Similarly, through the 1980s and 1990s, the scientific community began developing a deeper understanding of the molecular basis of disease. Researchers began identifying biomarkers of disease pathways that were used to diagnose disease, predict clinical outcomes, or select appropriate treatments. That era was similar to the pre–Kefauver-Harris Amendment era in that researchers were making discoveries rapidly but without the phased development process noted herein, especially the elements of AWCTs to generate substantial evidence of their utility. As described by Feng and Pepe (2020):

Researchers have a widely accepted guideline for moving a new drug target through the developmental pipeline, from phase 1 to 3 for FDA approval and phase 4 after market surveillance. A culture has been established and matured in conducting these trials. … Neither this culture nor these elements existed in 2000 for the cancer early-detection field: Isolated laboratory investigators received convenient specimens from their clinical collaborators and applied their favorite technology to search for biomarkers. Non-reproducible but highly acclaimed findings were rampant, but few successful translational products made it to clinical use. (p. 2575)

The EDRN has since established a phased process of development and a culture of rigor.

The acknowledgment that CPAs are medical interventions may come with some objections, and thus, the comprehensive phased approach described in the previous section may seem to be a waste of time and money. Encumbering the process of ‘digital medicine’ with substantial overhead may delay the implementation of tools that can improve patient care and avert deleterious outcomes. We offer some points to consider in this regard.

In our view, it is difficult not to see a CPA as directly analogous to a diagnostic test; the only difference is that a biological assay of some tissue or fluid is replaced by an algorithm applied to data. CPA developers use the language and tools of diagnostic testing (e.g., AUC) to guide their work, though as we have pointed out, such development needs to go further with considerations of PPV, NPV, and Total Value. CPAs are intended to intervene in the care of the patient by providing a diagnosis or prognosis, thereby alerting caretakers of the need to institute a new medical course of action. This is why it is important for clinical trials of CPAs to define outcomes that are directly related to patient morbidity or mortality or to systemwide reductions in the cost and resources needed to deliver quality care. Because of the critical nature of these considerations, a recent U.S. Food and Drug Administration Guidance declares that many CPAs come under their regulatory authority and, therefore, require more rigorous testing and validation prior to approval and implementation (U.S. Food and Drug Administration, 2022a).

As noted earlier, doing clinical trials or observational research for other medical interventions can be lengthy and costly endeavors. Recruiting patients into clinical trials takes many months and often years. Providing and paying for their care during the clinical trial is very expensive. Evaluation periods can last for years. Similar scenarios exist for biomarker and complex intervention research. For CPA development, feasibility and evaluation in a clinical setting could be much simpler. Since many CPAs are intended for hospital use and can be embedded in existing electronic medical record systems, there is a ready supply of patients. Furthermore, whether a diagnosis or prognosis from a CPA is confirmed or not often can be determined in a matter of days or weeks, thus dramatically shortening the follow-up time needed to assess clinical safety and efficacy outcomes. While our proposal suggests the use of multicenter trials and the necessary overhead of aligning many centers on a common protocol (e.g., training of staff on the CPA, embedding the CPS in the EMR system), randomization is easier at the hospital level for CRT than for individual patients in a medical intervention trial. Data collection is also facilitated by the EMR itself. Thus, we envision trials with considerably reduced cost and elapsed time compared with other interventions.

Science often requires a deliberate approach. The digital world is enamored with speed, but fast to the wrong answer is not helpful to anyone. We must keep the ‘science’ in data science.

5. Discussion and Recommendations

Various proposals have been made to improve the development, testing, validation, and reporting of diagnostic or prediction models such as TRIPOD, STROBE, CONSORT-AI, SPIRIT-AI, and various regulatory documents (Collins et al., 2015; Elm et al., 2007; European Medicines Agency, 2020; International Coalition of Medicines Regulatory Authorities, 2021; Liu et al., 2020; Rivera et al., 2020; U.S. Food and Drug Administration, 2018a, 2018b, 2021; Vollmer et al., 2020). These proposals generally focus on checklists for evaluating the quality of individual studies or results of statistical modeling approaches. None have gone as far as proposing the optimization of TV or a rigorous, sequential development process, as we propose here. While the TV is always of interest, we recognize that not all algorithms should be required to complete a full, phased development program. Comparisons can be made to over-the-counter remedies being more lightly regulated, generic product approvals requiring only bioequivalence studies, and updated and revised medical devices being approved based on minimal additional data.

The FDA final guidance on clinical decision support software (U.S. Food and Drug Administration, 2022a) contains many details and examples of what constitutes software as a medical device (SaMD) and what is non-device software. Within that guidance, FDA refers to the International Medical Device Regulators Forum’s (IMDRF) framework for a risk-based approach to regulating algorithms such as CPAs (IMDRF Software as a Medical Device (SaMD) Working Group, 2014). That framework, displayed in Table 3, categorizes the importance of the CPA according to the significance of the decision to be made and the seriousness of the health care context.

Table 3. SaMD categories established in IMDRF framework.

State of health care situation or condition

Significance of information provided by SaMD to health care decision

Treat or diagnose

Drive clinical management

Inform clinical management

Critical

IV

III

II

Serious

III

II

I

Non-serious

II

I

I

Note. Taken from the International Medical Device Regulators Forum (IMDRF) (IMDRF Software as a Medical Device (SaMD) Working Group, 2014). SaMD = software as a medical device.

We believe that IMDRF categories III–IV in Table 3 require a full development program supported by confirmatory trials and substantial evidence resulting in clear instructions for use, analogous to a regulatory approved drug label. Categories I–II may require lesser levels of evidence but should involve at least some prospective clinical trials to substantiate their use (e.g., feasibility studies as defined herein) and an abbreviated Model Facts Sheet. This proposal is a starting point, and more mature drug development concepts such as accelerated approval, priority review, and conditional approval are concepts that can be explored, but first, a culture of testing CPAs in prospective clinical trials with clinically meaningful outcomes needs to be established.

It appears that many of today’s CPAs fail in clinical implementation and adoption because they are not fully evaluated beyond the very early stages of discovery (Panch et al., 2019) (Textbox 2). As noted, the failures can arise from overreliance on AUC as the terminus of CPA development, ignorance of data shift that has many manifestations for introducing bias or misunderstanding of the human factors related to workflow and patient care. The magnitude of this problem cannot be overstated: lack of credibility could discourage clinicians from using sophisticated data and analytics to improve patient outcomes (Ghassemi et al., 2021). Furthermore, electronic algorithms can be quickly copied, installed, and promulgated across computer systems and geographies, dramatically magnifying the dangers of their misunderstanding and misuse; this reality argues for even greater rigor in their development, testing, and clinical validation, and in reporting their benefits and risks under the proposed conditions of use.

Textbox 2. CPAs to predict onset of sepsis.

Overreliance on area under the receiver operating characteristic curve (AUC) has played out in the evaluation of a predictive sepsis algorithm that has been part of a prominent, commercial electronic health records system (Bennett et al., 2018). The developer has published its stated AUC range, and some institutions appear to be satisfied with its use (Wong et al., 2021). Other institutions or health care systems have published contrasting estimates of AUC in their clinical settings and have questioned the reliability of the algorithm and its utility (Lyons et al., 2021; Tarabichi et al., 2022). As previously noted, AUC is but a piece of the puzzle for understanding the whole picture of a diagnostic test—bioanalytical or digital. Though full details are not published, there appears to be a different prevalence of sepsis across these institutions. Perhaps the Vi for TP, TN, FP, and FN differ for the different patient populations served at these hospitals.

Health care providers and academics have paid considerable attention to sepsis algorithms and their clinical utility, to the extent that some controversies have spilled into the lay press (Ross, 2022; Shashikumar et al., 2021). High false positive rates have led health care personnel to ignore alerts, and some institutions have modified workflows to lessen the burden on personnel and patients arising from too many positive alerts. This highlights the lessons from complex intervention research in which clinical context and care processes need careful consideration. Others are developing novel algorithms within clinical trials to evaluate the impact of such CPAs, albeit on surrogate measures such as ‘time to antibiotic use’ rather than patient outcomes like survival or health care outcomes like length of hospital stay or overall costs (Adams et al., 2022).

When evaluating a CPA for IMDRF categories III–IV, we believe it is reasonable to consider the “substantial evidence” standard described in the Kefauver-Harris Amendment to the Food Drug and Cosmetic Act. A CPA is a new product intended to improve patient care and, therefore, requires careful study to quantify its benefits (TP and TN) and unintended harms (FP or FN). Rigorous reviews and approvals could result in a regulatory-approved Model Facts Sheet that describes when, how, and in whom the new CPA product is useful, when it is not useful or harmful (similar to warnings in a drug label), and the clinical study results that support its use, much like the package insert for a new pharmaceutical treatment.

The data science community has arrived at its reckoning point and needs to evolve to a more deliberate, rigorous process for evaluating CPAs and their clinical utility. Similar reckonings occurred for drugs in the 1960s and for biomarkers in the 1990s. There is no lack of effort or productivity in the discovery and early testing of CPAs; it is relatively easy for anyone with a computer and with access to data to apply an AI/ML algorithm to discover patterns in the data and deliver purported answers. The difficulty lies in assessing the quality of those answers. No matter how sophisticated or elaborate the validation scheme for such algorithms, until they are tested and results corroborated in a broader, general clinical setting, they remain nothing more than black boxes with potential, akin to new therapeutic molecules or predictive biomarkers that show promise in the laboratory.

Data scientists and those promoting independent algorithms or systems with embedded algorithms need better incentives to adhere to rigorous processes for producing substantial evidence of positive benefit that outweighs any potential harms. Until the data science community engages in the development of CPAs and regulatory bodies embrace the “indispensable ordeal” of what can be a demanding stepwise process that involves AWCT, our collective efforts with CPAs to improve individual patient outcomes specifically and population health generally will founder, resulting in many wasted resources. It is time to insist CPA developers prove the worth of their tools by using well-known statistical measures of diagnostic utility and value and by following a well-worn path of phased development like those used for other medical interventions, thereby creating a reliable, reproducible standard that is aligned with the substantial evidence definition for new medicinal products and devices.

The scientific community has been here before—with pharmaceutical treatments, biomarkers, and complex intervention research. Good science requires systematic, controlled experiments that build on each other ultimately to improve the human condition. Data science has emerged as a new way to explore our world using the power of vast data stores, high-speed computing, and sophisticated algorithms. But the path to improving the human condition is no different. We are reminded of the T. S. Eliot quote, “We shall not cease from exploration, and the end of all our exploring will be to arrive where we started and know the place for the first time.”


Acknowledgments

The authors thank Pirinka Georgiev Tuttle and Corinne Steinbrenner for their editorial review. The authors also are very thankful to two anonymous reviewers who made valuable suggestions to improve the manuscript.

Disclosure Statement

S. J. R. is the Founder and President of Analytix Thinking, LLC. C.D. and S.M. are employees and shareholders of Pfizer, Inc. The authors declare no competing interests related to the content of this manuscript.


Supplementary Files

Supplementary Table 1. CPA Performance as a Function of the Cut-off, C

Cut-off

True Negative

False Positive

False Negative

True Positive

Sensitivity

Specificity

Accuracy

1 (CL)

15

10

0

20

1

0.6

0.78

2

15

10

1

19

0.95

0.6

0.76

3

16

9

1

19

0.95

0.64

0.78

4

17

8

1

19

0.95

0.68

0.80

5

17

8

2

18

0.9

0.68

0.78

6

18

7

2

18

0.9

0.72

0.80

7

18

7

3

17

0.85

0.72

0.78

8

19

6

3

17

0.85

0.76

0.80

9

19

6

4

16

0.8

0.76

0.78

10

20

5

4

16

0.8

0.8

0.80

11 (C)

20

5

5

15

0.75

0.8

0.78

12

21

4

5

15

0.75

0.84

0.80

13

21

4

6

14

0.7

0.84

0.78

14

21

4

7

13

0.65

0.84

0.76

15

22

3

7

13

0.65

0.88

0.78

16

23

2

7

13

0.65

0.92

0.80

17

23

2

8

12

0.6

0.92

0.78

18

23

2

9

11

0.55

0.92

0.76

19

24

1

9

11

0.55

0.96

0.78

20

24

1

10

10

0.5

0.96

0.76

21 (CU)

25

0

10

10

0.5

1

0.78

Note. From Figure 1 in the text, the vertical cut-off probability or score is moved from left to right, and at each point where an observation changes from a negative prediction to a positive prediction, a new sensitivity and specificity can be calculated. The cut-offs are simply listed in a numerical order, but in practice, there would be actual scores or probabilities that are outputs of the clinical predictive algorithm (CPA).

Supplementary Table 2. Clinical Predictive Algorithm Total Value as a Function of the Cut-off and Prevalence

 

Prevalence

 

 0.1 

0.25 

0.5 

0.75 

Cut-off

PPV

NPV

TV

PPV

NPV

TV

PPV

NPV

TV

PPV

NPV

TV

1 (CL)

0.22

1.00

−4.6

0.45

1.00

1.4

0.71

1.00

7.9

0.88

1.00

12.1

2

0.21

0.99

−5.0

0.44

0.97

0.4

0.70

0.92

5.7

0.88

0.80

6.9

3

0.23

0.99

−4.5

0.47

0.97

1.1

0.73

0.93

6.3

0.89

0.81

7.4

4

0.25

0.99

−4.0

0.50

0.98

1.8

0.75

0.93

7.0

0.90

0.82

8.0

5

0.24

0.98

−4.4

0.48

0.95

0.9

0.74

0.87

5.2

0.89

0.69

4.7

6

0.26

0.98

−3.8

0.52

0.96

1.8

0.76

0.88

6.0

0.91

0.71

5.3

7

0.25

0.98

−4.3

0.50

0.94

1.0

0.75

0.83

4.5

0.90

0.62

2.9

8

0.28

0.98

−3.5

0.54

0.94

2.0

0.78

0.84

5.4

0.91

0.63

3.6

9

0.27

0.97

−4.0

0.53

0.92

1.1

0.77

0.79

4.0

0.91

0.56

1.7

10

0.31

0.97

−3.0

0.57

0.92

2.4

0.80

0.80

5.0

0.92

0.57

2.4

11 (C )

0.29

0.97

−3.5

0.56

0.91

1.5

0.79

0.76

3.8

0.92

0.52

0.9

12

0.34

0.97

−2.2

0.61

0.91

3.0

0.82

0.77

4.9

0.93

0.53

1.5

13

0.33

0.96

−2.8

0.59

0.89

2.2

0.81

0.74

3.8

0.93

0.48

0.3

14

0.31

0.96

−3.3

0.58

0.88

1.3

0.80

0.71

2.7

0.92

0.44

−0.8

15

0.38

0.96

−1.7

0.64

0.88

3.2

0.84

0.72

4.0

0.94

0.46

−0.1

16

0.47

0.96

0.8

0.73

0.89

5.4

0.89

0.72

5.4

0.96

0.47

0.7

17

0.45

0.95

0.2

0.71

0.87

4.7

0.88

0.70

4.5

0.96

0.43

−0.2

18

0.43

0.95

−0.5

0.70

0.86

3.9

0.87

0.67

3.6

0.95

0.41

−1.0

19

0.60

0.95

3.9

0.82

0.86

7.1

0.93

0.68

5.3

0.42

−0.2

20

0.58

0.95

3.2

0.81

0.85

6.5

0.93

0.66

4.6

0.97

0.39

−0.9

21 (CU)

1.00

0.95

13.7

1.00

0.86

11.4

1.00

0.67

6.7

1.00

0.40

0.0

Note. Total value (TV) was derived for all possible cut-off values from Figure 1 in the body of the article. All possible cut-off values are used here as in Supplemental Table 1 to calculate positive predictive value (PPV) and negative predictive value (NPV). TV for each cut-off value is calculated using the VTP = 20, VFP = −15, VTN = 5, and VFN = −20 as used in the illustrative example in Textbox 1 of the body of the article.


References

Adams, R., Henry, K. E., Sridharan, A., Soleimani, H., Zhan, A., Rawat, N., Johnson, L., Hager, D. N., Cosgrove, S. E., Markowski, A., Klein, E. Y., Chen, E. S., Saheed, M. O., Henley, M., Miranda, S., Houston, K., Linton, R. C., Ahluwalia, A. R., Wu, A. W., & Saria, S. (2022). Prospective, multi-site study of patient outcomes after implementation of the TREWS machine learning-based early warning system for sepsis. Nature Medicine, 28(7), 1455–1460. https://doi.org/10.1038/s41591-022-01894-0

Artificial intelligence in health care: Within touching distance [Editorial]. (2017). The Lancet, 390(10114), 2739. https://doi.org/10.1016/s0140-6736(17)31540-4

Bennett, T., Russell, S., King, J., Schilling, L., Voong, C., Rogers, N., Adrian, B., Bruce, N., & Ghosh, D. (2018). Accuracy of the Epic Sepsis Prediction Model in a regional health system. ArXiv. https://doi.org/10.48550/arXiv.1902.07276

Campbell, M. J., & Walters, S. J. (2014). How to design, analyse and report cluster randomised trials in medicine and health related research. John Wiley & Sons. https://onlinelibrary-wiley-com.eu1.proxy.openathens.net/doi/10.1002/9781118763452.ch1

Choy, G., Khalilzadeh, O., Michalski, M., Do, S., Samir, A. E., Pianykh, O. S., Geis, J. R., Pandharipande, P. V., Brink, J. A., & Dreyer, K. J. (2018). Current applications and future impact of machine learning in radiology. Radiology, 288(2), 318–328. https://doi.org/10.1148/radiol.2018171820

Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W., Verbakel, J. Y., & Calster, B. V. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology, 110, 12–22. https://doi.org/10.1016/j.jclinepi.2019.02.004

Heagerty, P. J., & DeLong, E. R., for the NIH Health Care Systems Research Collaboratory Biostatistics and Study Design Core. (2017, August 25). Experimental designs and randomization schemes: Cluster randomized trials – ARCHIVED. In Rethinking clinical trials: A living textbook of pragmatic clinical trials. NIH Pragmatic Trials Collaboratory. https://doi.org/10.28929/004

Collins, G. S., Reitsma, J. B., Altman, D. G., & Moons, K. G. (2015). Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD Statement. BMC Medicine, 13(1), Article 1. https://doi.org/10.1186/s12916-014-0241-z

Connell, A., Montgomery, H., Martin, P., Nightingale, C., Sadeghi-Alavijeh, O., King, D., Karthikesalingam, A., Hughes, C., Back, T., Ayoub, K., Suleyman, M., Jones, G., Cross, J., Stanley, S., Emerson, M., Merrick, C., Rees, G., Laing, C., & Raine, R. (2019). Evaluation of a digitally-enabled care pathway for acute kidney injury management in hospital emergency admissions. Npj Digital Medicine, 2(1), Article 67. https://doi.org/10.1038/s41746-019-0100-6

DeMasi, O., Kording, K., & Recht, B. (2017). Meaningless comparisons lead to false optimism in medical machine learning. PLoS ONE, 12(9), Article e0184604. https://doi.org/10.1371/journal.pone.0184604

Dockès, J., Varoquaux, G., & Poline, J.-B. (2021). Preventing dataset shift from breaking machine-learning biomarkers. GigaScience, 10(9), Article giab055. https://doi.org/10.1093/gigascience/giab055

Drug Amendments Act of 1962, Pub. L. No. 87-781, 76 Stat. 780. (1962). https://www.govinfo.gov/content/pkg/STATUTE-76/pdf/STATUTE-76-Pg780.pdf

Elm, E. von, Altman, D. G., Egger, M., Pocock, S. J., Gøtzsche, P. C., Vandenbroucke, J. P., & Initiative, S. (2007). Strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. BMJ, 335(7624), Article 806. https://doi.org/10.1136/bmj.39335.541782.ad

Engelhard, M. M., Navar, A. M., & Pencina, M. J. (2021). Incremental benefits of machine learning—When do we need a better mousetrap? JAMA Cardiology, 6(6), 621–623. https://doi.org/10.1001/jamacardio.2021.0139

European Medicines Agency. (2020). EMA regulatory science to 2025- Strategic reflection. https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/ema-regulatory-science-2025-strategic-reflection_en.pdf

Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874. https://doi.org/10.1016/j.patrec.2005.10.010

Fawzy, A., Wu, T. D., Wang, K., Robinson, M. L., Farha, J., Bradke, A., Golden, S. H., Xu, Y., & Garibaldi, B. T. (2022). Racial and ethnic discrepancy in pulse oximetry and delayed identification of treatment eligibility among patients with COVID-19. JAMA Internal Medicine, 182(7), 730–738. https://doi.org/10.1001/jamainternmed.2022.1906

Feng, Z., & Pepe, M. S. (2020). Adding rigor to biomarker evaluations—EDRN experience. Cancer Epidemiology and Prevention Biomarkers, 29(12), 2575–2582. https://doi.org/10.1158/1055-9965.epi-20-0240

Fredrickson, D. S. (1968). The field trial: Some thoughts on the indispensable ordeal. Bulletin of the New York Academy of Medicine, 44(8), 985–993.

Ghaderzadeh, M., & Asadi, F. (2021). Deep learning in the detection and diagnosis of COVID-19 using radiology modalities: A systematic review. Journal of Healthcare Engineering, 2021, Article 6677314. https://doi.org/10.1155/2021/6677314

Ghassemi, M., Oakden-Rayner, L., & Beam, A. L. (2021). The false hope of current approaches to explainable artificial intelligence in health care. The Lancet Digital Health, 3(11), e745–e750. https://doi.org/10.1016/s2589-7500(21)00208-9

Harris, R. (1964). The real voice. Macmillan.

International Coalition of Medicines Regulatory Authorities. (2021, August 6). ICMRA Informal Innovation Network horizon scanning assessment report—Artificial intelligence. http://www.icmra.info/drupal/sites/default/files/2021-08/horizon_scanning_report_artificial_intelligence.pdf

International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use Steering Committee. (1998, September 1). Guidance for industry. E9 Statistical Principles for Clinical Trials. U.S. Food and Drug Admnistration. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/e9-statistical-principles-clinical-trials

IMDRF Software as a Medical Device (SaMD) Working Group. (2014, September 18). “Software as a medical device”: Possible framework for risk categorization and corresponding considerations. International Medical Device Regulators Forum. https://www.imdrf.org/sites/default/files/docs/imdrf/final/technical/imdrf-tech-140918-samd-framework-risk-categorization-141013.pdf

Is digital medicine different? [Editorial]. (2018) The Lancet, 392(10142), 95. https://doi.org/10.1016/s0140-6736(18)31562-9

Kappen, T. H., Klei, W. A. van, Wolfswinkel, L. van, Kalkman, C. J., Vergouwe, Y., & Moons, K. G. M. (2018). Evaluating the impact of prediction models: Lessons learned, challenges, and recommendations. Diagnostic and Prognostic Research, 2(1), Article 11. https://doi.org/10.1186/s41512-018-0033-6

Keane, P. A., & Topol, E. J. (2018). With an eye to AI and autonomous diagnosis. Npj Digital Medicine, 1(1), Article 40. https://doi.org/10.1038/s41746-018-0048-y

Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G., & King, D. (2019). Key challenges for delivering clinical impact with artificial intelligence. BMC Medicine, 17(1), Article 195. https://doi.org/10.1186/s12916-019-1426-2

Kim, D. W., Jang, H. Y., Kim, K. W., Shin, Y., & Park, S. H. (2019). Design characteristics of studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of medical images: Results from recently published papers. Korean Journal of Radiology, 20(3), 405–410. https://doi.org/10.3348/kjr.2019.0025

Landers, M., Saria, S., & Espay, A. J. (2021). Will artificial intelligence replace the movement disorders specialist for diagnosing and managing Parkinson’s Disease? Journal of Parkinson’s Disease, 11(s1), S117–S122. https://doi.org/10.3233/jpd-212545

Liu, X., Rivera, S. C., Moher, D., Calvert, M. J., Denniston, A. K., Chan, A.-W., Darzi, A., Holmes, C., Yau, C., Ashrafian, H., Deeks, J. J., Ruffano, L. F. di, Faes, L., Keane, P. A., Vollmer, S. J., Lee, A. Y., Jonas, A., Esteva, A., Beam, A. L., … Rowley, S. (2020). Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Nature Medicine, 26(9), 1364–1374. https://doi.org/10.1038/s41591-020-1034-x

Lyons, P. G., Ramsey, B., Simkins, M., & Maddox, T. M. (2021, May 1). How useful is the Epic Sepsis Prediction Model for predicting sepsis? TP14. TP014 Diagnostic and Screening Insights in Pulmonary, Critical Care, and Sleep, A1580. https://doi.org/10.1164/ajrccm-conference.2021.203.1_meetingabstracts.a1580

McKinney, S. M., Sieniek, M., Godbole, V., Godwin, J., Antropova, N., Ashrafian, H., Back, T., Chesus, M., Corrado, G. S., Darzi, A., Etemadi, M., Garcia-Vicente, F., Gilbert, F. J., Halling-Brown, M., Hassabis, D., Jansen, S., Karthikesalingam, A., Kelly, C. J., King, D., … Shetty, S. (2020). International evaluation of an AI system for breast cancer screening. Nature, 577(7788), 89–94. https://doi.org/10.1038/s41586-019-1799-6

Michelessi, M., Lucenteforte, E., Miele, A., Oddone, F., Crescioli, G., Fameli, V., Korevaar, D. A., & Virgili, G. (2017). Diagnostic accuracy research in glaucoma is still incompletely reported: An application of Standards for Reporting of Diagnostic Accuracy Studies (STARD) 2015. PLOS ONE, 12(12), Article e0189716. https://doi.org/10.1371/journal.pone.0189716

Moons, K. G. M., Stijnen, T., Michel, B. C., Büller, H. R., Es, G.-A. V., Grobbee, D. E., & Habbema, J. D. F. (1997). Application of treatment thresholds to diagnostic-test evaluation: An alternative to the comparison of areas under receiver operating characteristic curves. Medical Decision Making, 17(4), 447–454. https://doi.org/10.1177/0272989x9701700410

Murray, D. M., Taljaard, M., Turner, E. L., & George, S. M. (2020). Essential ingredients and innovations in the design and analysis of group-randomized trials. Annual Review of Public Health, 41, 1–19.

Nagendran, M., Chen, Y., Lovejoy, C. A., Gordon, A. C., Komorowski, M., Harvey, H., Topol, E. J., Ioannidis, J. P. A., Collins, G. S., & Maruthappu, M. (2020). Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies. BMJ, 368, m689. https://doi.org/10.1136/bmj.m689

Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453. https://doi.org/10.1126/science.aax2342

Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C. P., Heidenreich, P. A., Harrington, R. A., Liang, D. H., Ashley, E. A., & Zou, J. Y. (2020). Video-based AI for beat-to-beat assessment of cardiac function. Nature, 580(7802), 252–256. https://doi.org/10.1038/s41586-020-2145-8

Panch, T., Mattie, H., & Celi, L. A. (2019). The “inconvenient truth” about AI in healthcare. Npj Digital Medicine, 2(1), Article 77. https://doi.org/10.1038/s41746-019-0155-4

Park, Y., Jackson, G. P., Foreman, M. A., Gruen, D., Hu, J., & Das, A. K. (2020). Evaluating artificial intelligence in medicine: phases of clinical research. JAMIA Open, 3(3), 326–331. https://doi.org/10.1093/jamiaopen/ooaa033

Pauker, S. G., & Kassirer, J. P. (1980). The threshold approach to clinical decision making. New England Journal of Medicine, 302, 1109–1117. https://doi.org/10.1056/nejm198005153022003

Pepe, M. S., Etzioni, R., Feng, Z., Potter, J. D., Thompson, M. L., Thornquist, M., Winget, M., & Yasui, Y. (2001). Phases of biomarker development for early detection of cancer. JNCI: Journal of the National Cancer Institute, 93(14), 1054–1061. https://doi.org/10.1093/jnci/93.14.1054

Pladevall, M., Simpkins, J., Donner, A., & Nerenz, D. R. (2008). Designing multi-center cluster randomized trials: An introductory toolkit. NIH Collaboratory. https://dcricollab.dcri.duke.edu/sites/NIHKR/KR/Designing%20CRTs-IntroductoryToolkit.pdf

Pouwels, K. B., Widyakusuma, N. N., Groenwold, R. H. H., & Hak, E. (2016). Quality of reporting of confounding remained suboptimal after the STROBE guideline. Journal of Clinical Epidemiology, 69, 217–224. https://doi.org/10.1016/j.jclinepi.2015.08.009

Rivera, S. C., Liu, X., Chan, A.-W., Denniston, A. K., Calvert, M. J., Darzi, A., Holmes, C., Yau, C., Moher, D., Ashrafian, H., Deeks, J. J., Ruffano, L. F. di, Faes, L., Keane, P. A., Vollmer, S. J., Lee, A. Y., Jonas, A., Esteva, A., Beam, A. L., … Rowley, S. (2020). Guidelines for clinical trial protocols for interventions involving artificial intelligence: The SPIRIT-AI extension. Nature Medicine, 26(9), 1351–1363. https://doi.org/10.1038/s41591-020-1037-7

Roberts, M., Driggs, D., Thorpe, M., Gilbey, J., Yeung, M., Ursprung, S., Aviles-Rivero, A. I., Etmann, C., McCague, C., Beer, L., Weir-McCall, J. R., Teng, Z., Gkrania-Klotsas, E., Ruggiero, A., Korhonen, A., Jefferson, E., Ako, E., Langs, G., Gozaliasl, G., … Schönlieb, C.-B. (2021). Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence, 3(3), 199–217. https://doi.org/10.1038/s42256-021-00307-0

Ross, C. (2022, October 24). Epic’s overhaul of a flawed algorithm shows why AI oversight is a life-or-death issue. STAT. https://www.statnews.com/2022/10/24/epic-overhaul-of-a-flawed-algorithm/

Schwartz, R., Vassilev, A., Greene, K., Perine, L., Burt, A., & Hall, P. (2022). Towards a standard for identifying and managing bias in artificial intelligence. NIST. https://doi.org/10.6028/nist.sp.1270

Sendak, M. P., Gao, M., Brajer, N., & Balu, S. (2020). Presenting machine learning model information to clinical end users with model facts labels. Npj Digital Medicine, 3(1), Article 41. https://doi.org/10.1038/s41746-020-0253-3

Shashikumar, S. P., Wardi, G., Malhotra, A., & Nemati, S. (2021). Artificial intelligence sepsis prediction algorithm learns to say “I don’t know.” Npj Digital Medicine, 4(1), Article 134. https://doi.org/10.1038/s41746-021-00504-6

Shen, D., Wu, G., & Suk, H.-I. (2016). Deep learning in medical image analysis. Annual Review of Biomedical Engineering, 19, 221–248. https://doi.org/10.1146/annurev-bioeng-071516-044442

Siu, A. L. (2016). Screening for breast cancer: U.S. Preventive Services Task Force Recommendation statement. Annals of Internal Medicine, 164(4), 279–296. https://doi.org/10.7326/m15-2886

Sjoding, M. W., Dickson, R. P., Iwashyna, T. J., Gay, S. E., & Valley, T. S. (2020). Racial bias in pulse oximetry measurement. New England Journal of Medicine, 383(25), 2477–2478. https://doi.org/10.1056/nejmc2029240

Skivington, K., Matthews, L., Simpson, S. A., Craig, P., Baird, J., Blazeby, J. M., Boyd, K. A., Craig, N., French, D. P., McIntosh, E., Petticrew, M., Rycroft-Malone, J., White, M., & Moore, L. (2021). A new framework for developing and evaluating complex interventions: Update of Medical Research Council guidance. BMJ, 374, n2061. https://doi.org/10.1136/bmj.n2061

Sperrin, M., Grant, S. W., & Peek, N. (2020). Prediction models for diagnosis and prognosis in Covid-19. BMJ, 369, m1464. https://doi.org/10.1136/bmj.m1464

Tarabichi, Y., Cheng, A., Bar-Shain, D., McCrate, B. M., Reese, L. H., Emerman, C., Siff, J., Wang, C., Kaelber, D. C., Watts, B., & Hecker, M. T. (2022). Improving timeliness of antibiotic administration using a provider and pharmacist facing Sepsis Early Warning System in the emergency department setting: A randomized controlled quality improvement initiative. Critical Care Medicine, 50(3), 418–427. https://doi.org/10.1097/ccm.0000000000005267

Tomašev, N., Glorot, X., Rae, J. W., Zielinski, M., Askham, H., Saraiva, A., Mottram, A., Meyer, C., Ravuri, S., Protsyuk, I., Connell, A., Hughes, C. O., Karthikesalingam, A., Cornebise, J., Montgomery, H., Rees, G., Laing, C., Baker, C. R., Peterson, K., … Mohamed, S. (2019). A clinically applicable approach to continuous prediction of future acute kidney injury. Nature, 572(7767), 116–119. https://doi.org/10.1038/s41586-019-1390-1

U.S. Centers for Disease Control and Prevention. (2022, April 27). COVID Data Tracker. https://covid.cdc.gov/covid-data-tracker/#trends_dailycases

U.S. Food and Drug Administration. (2018a, April 11). FDA permits marketing of artificial intelligence-based device to detect certain diabetes-related eye problems. https://www.fda.gov/news-events/press-announcements/fda-permits-marketing-artificial-intelligence-based-device-detect-certain-diabetes-related-eye

U.S. Food and Drug Administration. (2018b, December 4). Software as a medical device (SaMD). https://www.fda.gov/medical-devices/digital-health-center-excellence/software-medical-device-samd

U.S. Food and Drug Administration. (2021, October 27). Good machine learning practice for medical device development: Guiding principles. https://www.fda.gov/medical-devices/software-medical-device-samd/good-machine-learning-practice-medical-device-development-guiding-principles

U.S. Food and Drug Administration. (2022a, September 28). Clinical decision support software guidance for industry and food and drug administration staff. https://www.fda.gov/media/109618/download

U.S. Food and Drug Administration. (2022b, November 1). Summary minutes Center for Devices and Radiological Health Medical Devices Advisory Committee Anesthesiology Devices Panel. https://www.fda.gov/media/164076/download

Vargesson, N. (2015). Thalidomide‐induced teratogenesis: History and mechanisms. Birth Defects Research Part C: Embryo Today: Reviews, 105(2), 140–156. https://doi.org/10.1002/bdrc.21096

Vickers, A. J., & Elkin, E. B. (2006). Decision curve analysis: A novel method for evaluating prediction models. Medical Decision Making, 26(6), 565–574. https://doi.org/10.1177/0272989x06295361

Vollmer, S., Mateen, B. A., Bohner, G., Király, F. J., Ghani, R., Jonsson, P., Cumbers, S., Jonas, A., McAllister, K. S. L., Myles, P., Granger, D., Birse, M., Branson, R., Moons, K. G. M., Collins, G. S., Ioannidis, J. P. A., Holmes, C., & Hemingway, H. (2020). Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness. BMJ, 368, l6927. https://doi.org/10.1136/bmj.l6927

Waller, L., & Levi, T. (2021). Building intuition regarding the statistical behavior of mass medical testing programs. Harvard Data Science Review, Special Issue 1. https://doi.org/10.1162/99608f92.19de8159

Wong, A., Otles, E., Donnelly, J. P., Krumm, A., McCullough, J., DeTroyer-Cooley, O., Pestrue, J., Phillips, M., Konye, J., Penoza, C., Ghous, M., & Singh, K. (2021). External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Internal Medicine, 181(8), 1065–1070. https://doi.org/10.1001/jamainternmed.2021.2626

Wu, E., Wu, K., Daneshjou, R., Ouyang, D., Ho, D. E., & Zou, J. (2021). How medical AI devices are evaluated: Limitations and recommendations from an analysis of FDA approvals. Nature Medicine, 27(4), 582–584. https://doi.org/10.1038/s41591-021-01312-x

Wynants, L., Calster, B. V., Collins, G. S., Riley, R. D., Heinze, G., Schuit, E., Bonten, M. M. J., Dahly, D. L., Damen, J. A., Debray, T. P. A., Jong, V. M. T. de, Vos, M. D., Dhiman, P., Haller, M. C., Harhay, M. O., Henckaerts, L., Heus, P., Kammer, M., Kreuzberger, N., … Smeden, M. van. (2020). Prediction models for diagnosis and prognosis of covid-19: Systematic review and critical appraisal. The BMJ, 369, m1328. https://doi.org/10.1136/bmj.m1328

Zhang, A., Xing, L., Zou, J., & Wu, J. C. (2022). Shifting machine learning for healthcare from development to deployment and from models to data. Nature Biomedical Engineering, 6(12), 1330–1345. https://doi.org/10.1038/s41551-022-00898-y


©2023 Stephen Ruberg, Sandeep Menon, and Charmaine Demanuele. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
0
comment
No comments here
Why not start the discussion?