Skip to main content
SearchLoginLogin or Signup

Improving Atrial Fibrillation Detection Using a Shared Latent Space for ECG and PPG Signals

Published onJan 30, 2025
Improving Atrial Fibrillation Detection Using a Shared Latent Space for ECG and PPG Signals
·

Abstract

Atrial fibrillation (AF) is the most common type of cardiac arrhythmia. It is associated with an increased risk of stroke, heart failure, and other cardiovascular complications, but can be clinically silent. The wide availability of consumer wearables has made continuous AF monitoring possible thanks to built-in electrocardiogram (ECG) and photoplethysmography (PPG) sensors. ECG is the gold standard for AF diagnosis, but it cannot be passively collected on consumer devices due to its requirement that the subject is stationary, making it impractical for continuous monitoring. On the other hand, PPG offers the possibility for passive monitoring, but it is susceptible to human movements and other environmental factors, posing a significant challenge for AF detection due to noise and artifacts. Ideally, we would get the best of both worlds: the accuracy of AF detection with ECG and the broad applicability of PPG. We propose a new approach, SiamAF, to bridge the gap between ECG and PPG signals that leverages ECG and PPG signal pairs available from wearables and hospital monitors. Our method helps elevate the predictive power of the (broadly applicable but noisy) PPG signals to a level comparable to that of (less applicable but more accurate) ECG. Specifically, we train a deep neural network to learn similar information from both ECG and PPG in a contrastive fashion, which encourages it to learn medically relevant features in both ECG and PPG signals. At inference time, the proposed model is able to accurately predict AF from either PPG or ECG and outperforms baseline methods on three external test sets.

Keywords: atrial fibrillation, PPG, ECG, Siamese network, contrastive learning, joint training


Media Summary

Atrial fibrillation (AF) is a common heart rhythm disorder, often without noticeable symptoms. Continuous and passive monitoring of AF is now possible due to the growing popularity of consumer wearable devices with built-in electrocardiogram (ECG) and photoplethysmography (PPG) sensors. Clinically, ECG is the gold standard for diagnosing AF, however, collecting ECG requires the patient to remain stationary, rendering it unsuitable for daily monitoring. On the other hand, PPG sensors can be used for passive monitoring during normal daily activities, but PPG is easily corrupted by movements and environmental factors such as ambient light in the room. The noise in the PPG signals poses significant challenges to AF detection. Ideally, we would get the best of both worlds: the accuracy of AF detection with ECG and the broad applicability of PPG. We propose a new approach, SiamAF, to bridge the gap between ECG and PPG signals that leverages ECG and PPG signal pairs available from wearables and hospital monitors. By encouraging our deep learning model to extract similar information from ECG and PPG signals, the model focuses more on medically relevant features in both ECG and PPG signals. Our method thus helps elevate the predictive power of the (broadly applicable but noisy) PPG signals to a level comparable to that of (less applicable but more accurate) ECG. Through external validations on data across several hospitals and wearable devices, we demonstrate that SiamAF outperforms other methods, offering robust and accurate AF detection.


1. Introduction

Atrial fibrillation (AF) is the most prevalent form of cardiac arrhythmia affecting approximately 1–2% of the general population and up to 9% of individuals aged 65 years or older (Colilla et al., 2013; Salih et al., 2021). AF contributed to over 183,000 deaths in the United States alone in 2019 (Centers for Disease Control and Prevention, National Center for Health Statistics, n.d.) and its prevalence is projected to continuously increase in the next 30 years (Lippi et al., 2021). Paroxysmal AF episodes often present few (or no) symptoms—that patients do not usually notice immediately—that are crucial precursors to more serious health conditions, including ischemic stroke and congestive heart failure. Therefore, the early detection of AF episodes has become imperative in the treatment and prevention of cardiovascular disease.

In recent years, AF detection has been transformed by the growing popularity of wearable devices with photoplethysmography (PPG) sensors, such as smartwatches. Unlike ECG signal collection which requires electrode contacts, PPG signals are effortlessly gathered through photonic sensors, enabling uninterrupted monitoring during daily activities. These PPG-capable devices are widely available, providing a noninvasive means for continuous heart rhythm monitoring.

While commercially available smartwatches have expanded the scope of passive monitoring for potential AF, the electrocardiogram (ECG) remains the clinical gold standard for diagnosing atrial fibrillation due to its superior accuracy and the wealth of diagnostic information compared to PPG. Specifically, PPG signals collected from smartwatches are not reliable enough to form AF diagnoses. This is primarily due to their susceptibility to environmental noise and human movements, which poses challenges for automated downstream AF detectors (Guo et al., 2021; Sañudo et al., 2019). Patients often still require hospital visits and undergo stationary ECG monitoring to be diagnosed with AF. This reliance on traditional methods creates a bottleneck in the AF diagnostic process due to the human effort involved. As a result, it is imperative to enhance PPG AF detection performance by leveraging the medically relevant information in PPG to match what we see in ECG signals.

While there has been much work using machine learning to improve AF detection from PPG (Corino et al., 2017; Eerikainen et al., 2020; Kiranyaz et al., 2016; Krivoshei et al., 2017; Kumar et al., 2018; Martis et al., 2014; Poh et al., 2018; Shan et al., 2016; Shashikumar et al., 2017; Tang et al., 2017; Torres-Soto & Ashley, 2020; Tutuko et al., 2021; Voisin et al., 2019; J. Wang, 2020), there is much room for improvement. It is possible, for instance, that subtle information in PPG becomes more apparent in a simultaneous ECG; as discussed above, ECG is much easier to learn from (but cannot be collected nearly as easily). Past work used signals from PPG and labels from ECG, since only PPG is available at test time. However, this setup may be inefficient: it may require an enormous amount of PPG data to learn the more subtle signals in PPG that are more visible in simultaneous ECG.

In this work, we hypothesize that important information from PPG can be learned more efficiently if we leverage ECG during training—even if it is not available at test time. That is, by learning a shared latent space between PPG and ECG signals and encouraging the model to learn similar features between ECG and PPG, relevant information that would be much harder to learn from PPG alone would become easier to identify if it were ‘mapped’ to ECG in the latent space. By this logic, it is possible that test performance from PPG signals could improve, even so far as if we had trained the model on (clean) ECG rather than (noisy) PPG. This is precisely the goal of the present work.

In addition, by training with both ECG and PPG, from a model learning perspective, PPG and ECG can serve as regularizations for each other during training against potential mislabeling and poor signal quality in the data set.

Analogously, this same idea could be used to improve ECG analysis by using information in PPG more effectively, so that even ECG test performance could improve using the shared latent space. In fact, for both PPG-only test sets and ECG-only test sets, our experiments show improved performance over other approaches.

This study has the following contributions:

  • To our knowledge, this is the first study leveraging shared information between ECG and PPG signals for AF detection. Our novel deep learning architecture and loss function design encourage the model to learn shared information and improve prediction robustness.

  • The proposed method outperforms baseline methods, including deep mutual learning (Zhang et al., 2018) and previous CNN-based ECG or PPG single modality AF detection networks (Torres-Soto & Ashley, 2020) on three external test sets containing diverse patient conditions and recording hardware.

  • We investigate the information learned by the model and verify the effectiveness of our proposed method through dimension reduction and visualizations of latent features.

  • Rather than learning two separate models for ECG and PPG modalities, the proposed method learns a single model that can be used for AF detection on both ECG or PPG signals with no performance sacrifice for either modality.

2. Related Works

We consider related work in AF detection with hand-crafted features, deep learning for AF detection, and Siamese networks.

Hand-Crafted Features for AF Detection. Multiple past works have developed hand-crafted features of the ECG or PPG input signals for training machine learning–based AF classifiers. These include the root mean square of the successive difference (RMSSD) of peak-to-peak intervals, Shannon entropy (ShE), spectral analysis, dynamic time warping for waveform shape analysis, and template matching, as well as other statistical features such as mean, variance, standard deviation, skewness, and kurtosis of input signals (Corino et al., 2017; Eerikainen et al., 2020; Krivoshei et al., 2017; Kumar et al., 2018; Martis et al., 2014; Shan et al., 2016; Tang et al., 2017). These features are fed into standard machine learning classification algorithms.

Most hand-crafted features rely on accurate peak detection in the raw signals, which is often unreliable, given poor signal qualities. These challenges have made the hand-crafted features approach difficult for applications to different populations and demographics.

Deep Learning for AF Detection. Deep learning networks automatically learn and extract features from raw inputs. Deep convolutional neural networks (DCNN/CNN) have been popular as feature extractors for both images (Anwar et al., 2018) and time series due to their unique ability to preserve useful information. There have been several CNN-based AF detection algorithms (Kiranyaz et al., 2016; Poh et al., 2018; Shashikumar et al., 2017; Torres-Soto & Ashley, 2020; Tutuko et al., 2021; Voisin et al., 2019; J. Wang, 2020). Some researchers convert the 1-D signals into 2D images using the short-term Fourier transform (STFT) and stationary wavelet transform (SWT) (Qayyum et al., 2018; Xia et al., 2018) for 2D-CNNs. Due to the time series nature of ECG and PPG signals, many researchers leverage recurrent neural networks (RNNs) for AF detection due to their ability to capture temporal relations and their flexibility with variable length inputs. Some of its variants, including the long short-term memory network (LSTM) and recurrent-convolutional neural network, have also been used for AF detection (Andersen et al., 2019; C. Chen et al., 2020; Fan et al., 2018; Gotlibovych et al., 2018; Oh et al., 2018). While deep neural networks are more adaptable than hand-crafted, features-based machine learning methods, they require significantly more labeled data to achieve performance equivalent to the best AF detectors. Deep learning models are prone to overfitting, especially on trivial task-irrelevant features. This is often demonstrated through the sensitivity of deep neural nets to signals that are low quality and contain signal artifacts; we can expect worse performances of the trained models on these low-quality signals (Ding et al., 2024; Ying, 2019). Deep learning methods also suffer from a lack of interpretability. It is difficult, if not impossible, for humans to understand the discovered features and decision processes of the deep neural networks; combined with the susceptibility to overfitting, it leads to an increased risk of undetected failure in real-world applications when operating conditions change or the model fails to generalize (Beede et al., 2020; Zech et al., 2018).

Two Modalities. Previous studies only trained their models using either ECG or PPG signals (Andersen et al., 2019; C. Chen et al., 2020; Corino et al., 2017; Eerikainen et al., 2020; Fan et al., 2018; Gotlibovych et al., 2018; Hannun et al., 2019; Kim & Pan, 2019; Kiranyaz et al., 2016; Krivoshei et al., 2017; Kumar et al., 2018; Martis et al., 2014; Oh et al., 2018; Poh et al., 2018; Pourbabaee et al., 2018; Qayyum et al., 2018; Sannino & De Pietro, 2018; Shan et al., 2016; Shashikumar et al., 2017; Tang et al., 2017; Tison et al., 2018; Torres-Soto & Ashley, 2020; Tutuko et al., 2021; Voisin et al., 2019; J. Wang, 2020; Xia et al., 2018), despite the availability of both modalities measured simultaneously on the same individuals. As we will show, both ECG and PPG signals carry crucial information for detecting AF; neither modality should be wasted during training.

Siamese Networks. The Siamese network architecture (models with two branches or subnetworks) is useful for dual input scenarios. In recent years, Siamese networks have seen a rise in popularity due to their usefulness for self-supervised learning (T. Chen et al., 2020; X. Chen & He, 2021; Grill et al., 2020). However, Siamese self-supervised networks are difficult to train and rely heavily on image augmentations.

3. Methods

Figure 1. The architecture of our proposed framework for learning from photoplethysmography (PPG) and electrocardiogram (ECG) and predicting (i.e., testing) on only either ECG or PPG. At training time, the model takes both ECG and PPG signals as inputs. In each training iteration, the two signal modalities take turns flowing through the network following each of the colored paths; that is, the PPG signal flows through the red path and the ECG signal flows through the purple path, then the ECG signal flows through the red path and the PPG signal flows through the purple path. In the configuration shown in the figure, the PPG and ECG signal inputs will go through the encoder fθf_{\theta} and the projector gϕg_{\phi}. The learned features of the PPG signal will pass through the predictor qψq_{\psi} to map to the latent space of ECG features. We optimize an agreement loss between the predicted latent feature vector of the PPG signal and the projected latent feature vector of the ECG signal; we also optimize a supervised cross-entropy loss for the output of the classifier hωh_{\omega} (which takes PPG latent features as inputs). For a detailed description of the training process, please see Section 3 and Algorithm 1. After the training is complete, we save only the encoder and the classifier for future predictions.

The training data is denoted as D={(xiECG,xiPPG,yi)}i=1DD=\{(x^{ECG}_{i}, x^{PPG}_{i}, y_i)\}^{|D|}_{i=1}. Our joint training framework takes a pair of time-synchronized ECG and PPG signals as input (xiECGx^{ECG}_{i} and xiPPGx^{PPG}_{i}), and a binary label of AF/non-AF (yiy_i) generated automatically from the hospital alarm. The auto-generated labels contain a certain amount of noise. Our proposed framework learns by simultaneously maximizing the agreement between the latent projections of the ECG-PPG signal pairs, and minimizing the misclassification error of the classifications from the ECG and PPG signals. Our framework is a Siamese network with five major components as shown in Figure 1; here we use terminology from the Bootstrap Your Own Latent (Grill et al., 2020) model:

  • A neural network encoder fθ()f_{\theta}(\cdot), which takes both the ECG xiECGx^{ECG}_{i} and PPG xiPPGx^{PPG}_{i} signals as inputs. The encoder extracts feature representations from the raw inputs. In theory, the architecture of the encoder can take any form, so future applications can choose a different architecture than the specific one used here. Here we use a 1-Dimensional version of the ResNet-34 (He et al., 2016) architecture for the encoder due to its effectiveness for physiological signals.

  • A multilayer-perception projector gϕ()g_{\phi}(\cdot) with one hidden layer. The projector takes the feature outputs from the encoder as input. The projector will map the extracted features from the encoder to a latent space where we maximize the agreements between the features of ECG and PPG signals. The projected views are denoted as ziECG=gϕ(fθ(xiECG))z^{ECG}_i = g_{\phi}(f_{\theta}(x^{ECG}_{i})) for an ECG input and ziPPG=gϕ(fθ(xiECG))z^{PPG}_i = g_{\phi}(f_{\theta}(x^{ECG}_{i})) for a PPG input. (The gϕ()g_{\phi}(\cdot) component is the same regardless of whether we translate ECG to PPG or vice versa.)

  • A linear layer predictor qψ()q_{\psi}(\cdot). The predictor takes one of the projector’s outputs from the ECG-PPG pair, and further maps the projected views. The predictor attempts to predict the latent feature from the other branch using a different view as the input. We denote the predictor’s predictions as qψ(ziECG)q_{\psi}(z^{ECG}_i) for an ECG input and qψ(ziPPG)q_{\psi}(z^{PPG}_i) for a PPG input. (Here fθ()f_{\theta}(\cdot), gϕ()g_{\phi}(\cdot), and qψ()q_{\psi}(\cdot) are the same regardless of whether we translate ECG to PPG or vice versa and regardless of whether we predict for ECG or PPG.)

  • A supervised branch classification head hω()h_{\omega}(\cdot), consists of one linear layer. It takes the output from the encoder fθ()f_{\theta}(\cdot) as input, and produces logits of piECG=hω(fθ(xiECG))p^{ECG}_i = h_{\omega}(f_{\theta}(x^{ECG}_{i})) for an ECG input and piPPG=hω(fθ(xiPPG))p^{PPG}_i = h_{\omega}(f_{\theta}(x^{PPG}_{i})) for a PPG input.

  • A loss function combining both the agreement and classification objectives. The agreement objective is to maximize the cosine distance between the (qψ(ziECG),ziPPG)(q_{\psi}(z^{ECG}_i),z^{PPG}_i) pair and between the (qψ(ziPPG),ziECG)(q_{\psi}(z^{PPG}_i),z^{ECG}_i) pair. We formulate the agreement objective loss function as follows:

    Lagree(xiPPG,xiECG)=qψ(ziECG),ziPPGqψ(ziECG)2ziPPG2qψ(ziPPG),ziECGqψ(ziPPG)2ziECG2.(3.1)\quad \quad \begin{split} & \mathcal{L}_{\textit{agree}}(x^{PPG}_i, x^{ECG}_i) \\ & = - \frac{\left\langle q_{\psi}(z^{ECG}_i), z^{PPG}_i\right\rangle}{\left\|q_{\psi}(z^{ECG}_i)\right\|_{2} \cdot\left\|z^{PPG}_i\right\|_{2}} - \frac{\left\langle q_{\psi}(z^{PPG}_i), z^{ECG}_i\right\rangle}{\left\|q_{\psi}(z^{PPG}_i)\right\|_{2} \cdot\left\|z^{ECG}_i\right\|_{2}}. \end{split} \quad \quad (3.1)

    Here 2\left\|\cdot\right\|_{2} is the 2\ell_2-norm and \left\langle\cdot\right\rangle is the inner product. To minimize the misclassification error, we employ the cross-entropy loss. By using a hyperparameter λ\lambda, the final loss function is formulated as a weighted combination of the agreement loss and the classification loss; it is defined as follows:

    Ljoint(xiPPG,xiECG,yi)=Lagree(xiPPG,xiECG)+λ(LCE(piPPG,yi)+LCE(piECG,yi)).  (3.2)\quad \quad \quad \quad \quad \quad \begin{aligned} & \mathcal{L}_{\textit{joint}}(x^{PPG}_i, x^{ECG}_i, y_i) \\ & = \mathcal{L}_{\textit{agree}}(x^{PPG}_i, x^{ECG}_i) \\ & + \lambda \cdot (\mathcal{L}_{\textit{CE}}(p^{PPG}_i, y_i)+\mathcal{L}_{\textit{CE}} (p^{ECG}_i, y_i)). \end{aligned} \quad \quad \quad \quad \ \ (3.2)

    where LCE=(yilog(piPPG)+(1yi)log(1piPPG))\mathcal{L}_{\textit{CE}} = - \left( y_i \cdot \log(p^{\text{PPG}}_i) + (1 - y_i) \cdot \log(1 - p^{\text{PPG}}_i) \right).

In cases of incomplete ECG and PPG pairs (either ECG or PPG signals) in the training data, we can accommodate that in our framework by adding two additional loss terms, one for additional PPG data and one for additional ECG data:

  Ljoint(xiPPG,xiECG,yi)=Lagree(xiPPG,xiECG)+λ(LCE(piPPG,yi)+LCE(piECG,yi))+λ1(LCE(piPPGunpaired,yi)+LCE(piECGunpaired,yi)).(3.3)\quad \quad \quad \quad \ \ \begin{aligned} & \mathcal{L}_{\textit{joint}}(x^{PPG}_i, x^{ECG}_i, y_i) \\ & = \mathcal{L}_{\textit{agree}}(x^{PPG}_i, x^{ECG}_i) \\ & + \lambda \cdot (\mathcal{L}_{\textit{CE}}(p^{PPG}_i, y_i)+\mathcal{L}_{\textit{CE}} (p^{ECG}_i, y_i)) \\ & +\lambda_{1} \cdot (\mathcal{L}_{\textit{CE}}(p^{PPG_{\textit{unpaired}}}_i, y_i)+\mathcal{L}_{\textit{CE}} (p^{ECG_{\textit{unpaired}}}_i, y_i)). \end{aligned} \quad \quad (3.3)

In addition to the previously introduced model information, Algorithm 1 illustrates the training algorithm of the proposed framework. After the training is complete, we discard the projector gϕg_{\phi} and predictor qψq_{\psi} and keep the learned encoder fθf_{\theta} and the classification head hωh_{\omega} for future predictions. At test time, the learned network can work with either ECG or PPG signals without requiring an ECG-PPG pair.

Algorithm 1. Joint training algorithm.

input : D={(xiECG,xiPPG,yi)}i=1D,fθ,gϕ,qψ,hω,hyperparameterλD=\{(x^{ECG}_{i}, x^{PPG}_{i}, y_i)\}^{|D|}_{i=1}, f_{\theta}, g_{\phi}, q_{\psi}, h_{\omega}, hyper-parameter \lambda

  1. for sample_minibatch_by_index DD do

  2. // pass ECG signal\textcolor{blue}{\text{// pass ECG signal}}

  3. ziECG=gϕ(fθ(xiECG))z^{ECG}_i = g_{\phi}(f_{\theta}(x^{ECG}_{i})) // ECG latent projection\textcolor{blue}{\text{// ECG latent projection}}

  4. qψ(ziECG)q_{\psi}(z^{ECG}_i) // ECG latent projection\textcolor{blue}{\text{// ECG latent projection}}

  5. piECG=hω(fθ(xiECG))p^{ECG}_i = h_{\omega}(f_{\theta}(x^{ECG}_{i})) // ECG class. logits\textcolor{blue}{\text{// ECG class. logits}}

  6. // pass PPG signal\textcolor{blue}{\text{// pass PPG signal}}

  7. ziPPG=gϕ(fθ(xiPPG))z^{PPG}_i = g_{\phi}(f_{\theta}(x^{PPG}_{i})) // PPG latent projection\textcolor{blue}{\text{// PPG latent projection}}

  8. qψ(ziPPG)q_{\psi}(z^{PPG}_i) // PPG latent projection\textcolor{blue}{\text{// PPG latent projection}}

  9. piPPG=hω(fθ(xiPPG))p^{PPG}_i = h_{\omega}(f_{\theta}(x^{PPG}_{i})) // PPG class. logits\textcolor{blue}{\text{// PPG class. logits}}

  10. Update fθ,gϕ,qψ,hωf_{\theta}, g_{\phi}, q_{\psi}, h_{\omega} to minimize Ljoint(xiPPG,xiECG,yi)\mathcal{L}_{joint} (x^{PPG}_i, x^{ECG}_i, y_i)

  11. end

  12. return fθ,hωf_{\theta}, h_{\omega}

4. Data Sets

We used a large-scale data set from Institution A (University of California San Francisco medical center) for training and three additional data sets for evaluation, including a data set from Institution B (University of California Los Angeles medical center), as well as the Stanford data set test split (Torres-Soto & Ashley, 2020), and the Simband data set (Shashikumar et al., 2017).

4.1. Institution A Data Set

The Institution A (University of California San Francisco Medical Center) data set contains 28,539 patients in hospital settings; the patients’ continuous ECG and PPG signals were recorded from the bedside monitors. The bedside monitor produced alarms for events including atrial fibrillation (AF), premature ventricular contraction (PVC), tachycardia, Couplet (consecutive abnormal heart beats), and so on. This study focuses on AF, PVC, and normal sinus rhythm (NSR). The samples with PVC and NSR labels were combined into the Non-AF samples group, thus forming the AF vs. Non-AF binary classification task. The continuous ECG and PPG signals were sliced into 30-second nonoverlap segments sampled at 240 Hz (each signal strip contains 7,200 timesteps). The 30-second segments were then downsampled to 2,400 timesteps (80-Hz sampling rate). During the preprocessing step, invalid samples (e.g., empty signals files, missing ECG or PPG signals) were also filtered out. There are four ECG channels in this data set; we used the first ECG channel for our study due to its resemblance to wearable device outputs. The data set was split into the train and validation splits by patient IDs. The train split of the Institution A data set contains 13,432 patients, 2,757,888 AF signal segments, and 3,014,334 Non-AF signal segments; the validation split contains 6,616 patients, 1,280,775 AF segments, and 1,505,119 Non-AF segments. Due to the automatic nature of bedside monitor–generated labels, the data set likely contains label noise. For a detailed description of the alarm labeling process and the additional preprocessing steps, please see our parallel work (Ding et al., 2024). For detailed demographic information of patients who participated in this study, please refer to Table A6. The data collection process followed strict guidelines and the study was approved by the Institutional Review Board (IRB number:14-13262). A waiver of informed patient consent was granted for this study, as the investigation solely involved the analysis of de-identified patient data.

4.2. Institution B Data Set (Test Set 1)

The Institution B (University of California Los Angeles Medical Center) data set contains 126 patients in hospital settings and simultaneous continuous ECG and PPG signals were collected at Institution B. The patients have a minimum age of 18 and a maximum age of 95 years old and were admitted from April 2010 to March 2013. The continuous signals were sliced into 30-second nonoverlapping segments and again downsampled to 80 Hz sampling rate with 2,400 timesteps in each signal. The data set contains 38,910 AF and 220,740 Non-AF segments. Board-certified cardiac electrophysiologists annotated all AF episodes in the Institution B data sets. Here the PPG signals are collected from the fingertips. The use of this data set for the study was approved by the IRB (IRB approval number: 10-000545). A waiver of informed patient consent was obtained for this study.

4.3. Simband Data Set (Test Set 2)

The Simband data set (Shashikumar et al., 2017) contains 98 patients in ambulatory settings from Emory University Hospital (EUH), Emory University Hospital Midtown (EUHM), and Grady Memorial Hospital (GMH). The patients have a minimum age of 18 years old and a maximum age of 89 years old; patients were admitted from October 2015 to March 2016. The ECG signals were collected using Holter monitors, and the PPG signals were collected from a wrist-worn Samsung Simband. The signals used for testing were 30-second segments with 2,400 timesteps after preprocessing. This data set contains 348 AF segments and 506 Non-AF segments. The signals in this data set were reviewed and annotated by an ECG technician, physician study coordinator, and cardiologist. The data were collected with patient consent and IRB approval (IRB approval number: 00084629).

4.4. Stanford Data Set (Test Set 3)

The Stanford data set (Torres-Soto & Ashley, 2020) contains 107 AF patients, 15 paroxysmal AF patients, and 41 healthy patients. The 41 healthy patients also undergo an exercise stress test. All signals in this data set were recorded in ambulatory settings. The ECG signals were collected from an ECG reference device, and the PPG signals were collected from a wrist-worn device. The signals were sliced into 25-second segments by the original author. In this study, we augmented the signal to 30 seconds by pasting the first 5 seconds of each signal to the end and resampled the signals to 2,400 timesteps. The data set contains 52,911 AF segments and 80,620 Non-AF segments. In the evaluations, we use the test split generated by the authors of the Stanford data set. The PPG signals in this data set were manually annotated and reviewed by several cardiologists following computerized reference ECG signals.

5. Experiments and Evaluation Results

In this section, we aim to answer the questions on how the SiamAF performs compared to other baseline methods (Section 5.1), we also investigate the learned latent space and empirically verify that the model learns the medically relevant and shared information between ECG and PPG signals (Section 5.2).

5.1. Performance Comparison With Baseline Methods

We verify our framework’s performance advantage by comparing it to three baseline methods: (1a) A ResNet-34 (He et al., 2016) classifier trained on only PPG data; (1b) A ResNet-34 (He et al., 2016) classifier trained on only ECG data; (2) A deep mutual learning (Zhang et al., 2018) (DML for short) network using ResNet-34 as encoders; (3) The public Stanford DeepBeat model (Torres-Soto & Ashley, 2020), which was trained on the Stanford data set. The ResNet-34 baselines serve as representations of performance in previous deep learning AF detection studies, where convolutional neural networks appear as a popular choice (Kiranyaz et al., 2016; Poh et al., 2018; Shashikumar et al., 2017; Tutuko et al., 2021; Voisin et al., 2019; J. Wang, 2020). The DML baseline is known for its ability to extract shared information from dual inputs through ‘mutual learning,’ which could be suitable for our intended use. The performance of the models is evaluated on the three external data sets we introduced in Section 4. In this work, we do not compare against hand-crafted feature-based AF classifiers due to their lackluster performance compared to deep learning methods, demonstrated in previous works (Torres-Soto & Ashley, 2020). We trained two ResNet-34 baselines for the ECG and PPG signals. The DML baseline contains two separate models for the two signal modalities. The Stanford DeepBeat baseline only operates on PPG signals. We will use both AUROC (area under the receiver-operator-characteristic curve) and AUPRC (area under the precision-recall curve) as evaluation metrics. We also conducted bootstrapping tests with 1,000 bootstrapping samples for statistical significance comparisons.

The evaluation results are shown in Figure 2. In the PPG test sets, the proposed method SiamAF performs comparably or substantially better than the baseline methods; in the ECG test sets, our method outperforms the baseline methods by a significant margin. For detailed values and the description of the bootstrapping calculation, please refer to Section A.1, Tables A7 and A8 in the Appendix.

Figure 2. The performance comparisons of all models on the test sets. The Stanford data set does not contain electrocardiogram (ECG) signals and thus cannot be used for testing ECG models. All models evaluated in this experiment were trained on the Institution A data set, except for the DeepBeat model, which was trained on the Stanford data set. In both plots, the horizontal axis is the AUPRC (area under the precision-recall curve) score, and the vertical axis is the AUROC (area under the receiver-operator-characteristic curve) score. The closer to the top right corner, the better. Here, confidence intervals are generally smaller than the dot sizes, so we provide a zoom-in of one of the dots to demonstrate the 95% CI. For exact CI and statistical significance test results, please refer to Section A.4 and Section A.2.

5.2. Exploration on Learned Representations and Interpretability

ECG and PPG signals contain similar data, including RR (heartbeat) intervals and peaks, which are important diagnostic indicators for AF by human experts. We designed our model based on these facts, by encouraging it to learn and exploit the shared information between the two modalities for AF detection. Ideally, our model should behave similarly to humans utilizing this shared information (e.g., RR intervals) for the AF detection task. To verify our design approach, we explore the learned representation of the encoder in our proposed framework through dimension reduction and visualization of the activations from layers in the network. For visualization purposes, we randomly subsampled 1% of the validation set and passed the samples through the learned ResNet-34 encoder in our proposed method.

Medically, signal peaks and peak intervals from the PPG and ECG signals are crucial information for AF detection. Naturally, in a robust model, peak information should likewise contribute substantially toward AF classification. For nonnoisy data, the peaks are time-synchronized between PPG and ECG signals. By using only the peaks from the PPG signal, we can capture essential information for AF detection that is comparable to the information obtained from using the entire PPG signal. Likewise, the model should behave similarly when we retain only the peaks in ECG signals to when we retain the full ECG signal. Finally, the PPG and ECG signals contain similar information to each other, which means that for AF detection, they likely will map to the same neighborhood in the latent space.

We can investigate the expected behaviors through visualization of the latent space. To do so, we extract peak information from the PPG signals using the HeartPy toolkit (van Gent et al., 2019a, 2019b) and the R peaks in the ECG signals with NeuroKit2 (Makowski et al., 2021). We create new peak-only PPG signals that are set to 0 everywhere except at the peaks and we do the same for ECG, preserving only all R peaks. Figure 3 shows an example of the PPG detection and the peak-only signals. These peak-only signals allow us to feed the model only peak information and remove other waveform information. Thus we aim to explore two hypotheses—whether (1) the peak-only PPG signals map to the same location in latent space as full PPG signals (and similarly for EEG) and (2) whether the PPG and EEG signals from the same time period of the same patient map to nearby locations in latent space.

Figure 3. An example of the peak detection results on a pair of time-synchronized (simultaneous) 30-second electrocardiogram (ECG) and photoplethysmography (PPG) signals and the corresponding encoded peak sequences. Here, the red dots indicate the detected peaks.

We use PaCMAP (Y. Wang et al., 2021) to visualize the feature outputs of each pooling layer in SiamAF, shown in Figure 4. We also visualized feature outputs for both ECG and PPG-only and the DML baseline models in Figure 4. It is obvious that at the later stage of the learned encoder in our proposed framework, both the PPG and ECG peak-only-signal features (labeled as “PPG Peaks” and “ECG R Peaks”) overlap almost perfectly with the original PPG and ECG-signal features. This lends support to hypotheses (1) and (2) described in the last paragraph. In contrast, Figure 5 shows that for the ResNet-34 and DML approaches, the encoded peak-only features never overlap with the PPG and ECG features, even at the final stages of the models.

Figure 4. The four plots visualize latent features after each pooling layer of the ResNet-34 encoder used in our proposed framework. Figure (a), (b), (c), (d), (e), each represent one of the stages. In Figure (e), we also visualize the separation between AF (atrial fibrillation) and Non-AF classes in the same latent space. At the later stage of the learned encoder in our proposed framework, both the photoplethysmography (PPG) and electrocardiogram (ECG) peak-only-signal features overlap almost perfectly with the original PPG and ECG-signal features.

Figure 5. Visualizations of latent features at the final average pooling layer of the encoders in the baseline models. Here, the encoded peak-only features never overlap with the photoplethysmography (PPG) and electrocardiogram (ECG) features.

5.3. Hyperparameter Selection

During training processes for all experiments, we used 30 training epochs; ResNet-34 baseline models used Adam as the optimizer and a learning rate of 0.0001; the proposed method was trained using the stochastic gradient descent optimizer, with the learning rate set to 0.1 and momentum set to 0.9. For the DML baseline, the weight of the Kullback–Leibler-divergence loss was set to 0.9. For our proposed method, the λ\lambda was set to 1. We also tested the effect of the hyperparameter λ\lambda’s value in our proposed method; test performances were insensitive to the λ\lambda value within the range of 0.2 to 1.2; in this study it was set to 1.0 by default.

6. Discussion

This article provides a new framework to learn from simultaneous ECG and PPG signals for AF detection. The framework has several important benefits over other approaches: (1) it is more accurate for detecting AF; (2) its latent space reflects that it uses medically relevant information for AF detection (namely, the signal peaks); (3) it can detect AF from either ECG or PPG modalities using a single model rather than using separate models for ECG and PPG.

We make some observations beyond our direct results. First, it seems that learning from simultaneous PPG and ECG is better than learning from them asynchronously (or from either one alone). This can be seen because our method and the only baseline that uses simultaneous PPG/ECG (namely, DML) perform better than the other methods. The cleaner ECG signals could provide some regularization to the model in cases where PPG signals are noisy or conflicting with the auto-generated labels due to signal noise.

Second, there is a large gap in performance between DML and our proposed approach. Since they both use simultaneous PPG/ECG, this gap is likely due to our proposed agreement loss, which directly encourages the encoder to learn coherent latent features between the two signal modalities. On the other hand, the DML setup relies on minimizing the KL divergence between the two prediction probability distributions, which is a weaker target that does not incentivize the encoders to focus on shared characteristics that exist in both the ECG and PPG signals. In fact, the two encoders in the two branches of DML could be vastly different and focus on less important features.

Third, the DeepBeat baseline performed worse on both Simband and Institution B data sets compared to other baseline methods; this indicates the poor generalizability of the DeepBeat model to fingertip-recorded PPG signals in the Institution B data set and out-of-distribution data in the Simband data set. That is, DeepBeat does not seem to generalize beyond the type of data it was trained on, suggesting that it is not focusing on the correct aspects of the signal for generalization. It was trained on the Stanford training set, and performs well on the Stanford test set, but not on data sets collected using different hardware and under different conditions. In contrast, our proposed method is trained on data collected in a hospital, yet attained comparable performance to DeepBeat on the Stanford test set, which demonstrates our strong generalizability to ambulatory data.

Fourth, we consider the question of how much nonpeak information might be valuable for AF detection. As we saw, other methods seem to be using information other than peaks to detect AF. And, as we also saw, that information does not seem to generalize across modalities (PPG/ECG) and across data collection modes (ambulatory vs.. hospital setting). By experimenting with peak-only signals, it appears that, at least in our approach, largely the peak information is used; the model determined this was the important shared information between ECG and PPG signals to retain. By distilling and leveraging the shared information between the usually clean ECG signals and PPG signals, the model reduces the effect of other morphological artifacts that appear in PPG signals.

Lastly, we discuss the value and need for labels compared to the value of the agreement loss. Large labeled data sets are expensive to obtain and often noisy, particularly for medical applications. Our approach performed as well as or even better than some of the baselines, even using only 1% of the training labels. We show the AUROC scores and AUPRC scores of SiamAF trained with 1% labels in Appendix Tables A7 and A8. Compared to fully relying on supervised learning with noisy training labels, our proposed framework outperformed it by a significant amount. These results suggest that our learning regime greatly reduced the reliance on unreliable and noisy labels to provide learning supervision; the unsupervised portion of the training loss function—the agreement loss—provides a large portion of the meaningful gradients for parameter updates and steers the learning algorithms toward the correct information.

7. Limitations and Future Works

There are several limitations to the data set used in this study. First, while our framework and model are designed for broader application scenarios, including both healthy individuals and hospitalized patients, the training set included only patients in hospital settings. This homogeneous population may limit model performance and generalizability. Although the model has demonstrated robustness and generalizability across various hardware platforms and patient demographics from multiple institutions, it may still retain unknown biases derived from hospital-specific data.

Second, the data collection and labeling process relied on bedside monitors, which can produce false alarms and noisy labels. Such label noise could introduce the inherent biases of the monitor algorithm into the model. Additionally, label noise may disproportionately affect ECG data compared to PPG. Future work should explore robust learning algorithms to address these challenges and improve model performance.

Lastly, we recognize the potential biases introduced during preprocessing. The root causes of missing data remain unclear, but the missingness may disproportionately impact certain patients or subpopulations due to their health conditions or behavioral patterns. This could further introduce bias into the model. Future studies should examine the patterns of missing data and their effects on different subpopulations to mitigate this limitation.

In the experiment in Section 5.2, we showed that the projection of the peak-only signals highly overlaps with ECG and PPG signal projections in the latent space. However, it is not a complete overlap, thus we hypothesize that the model still leverages other potential morphological features for AF classification. Future work should also focus on developing more interpretable networks, which can provide explicit explanations for information used for classification. We also recognize that the visualization approach is not a formal verification for the proposed hypothesis on our model’s learned knowledge, and any dimension reduction method could produce inaccurate mappings of the high-dimensional space. Future work should quantitatively evaluate the learned knowledge of the proposed model.

In future work, we may consider examining other heart conditions than AF. AF is a prominent heart condition, but many other tasks are worth investigating. It is possible that our framework will assist with those tasks as well. Future studies should also expand the signal modalities beyond ECG and PPG. In this study, ECG and PPG could be framed as augmented measurements of the original heartbeat, augmented through the human body and the collection process. Another direction for future work is data augmentation. As usual, if we knew the transformations of the data that are invariant to AF detection, we could build those into our data, yielding more robust AF detection.

In this work, we mainly focus on improving the performance of AF detection from ECG and PPG signals and validate the model’s performance on several diverse data sets. However, to implement this model in a real-world setting, more factors must be considered, including inference time, energy efficiency of the model, privacy and hardware compatibility, and so on. Most of the aforementioned issues are being studied, and many solutions exist to address these problems, such as mixed precision computing, model quantization, and privacy-focused federated learning. Though beyond the scope of this study, future work could incorporate these considerations and focus on deploying our model in real-world clinical settings.

8. Conclusion

In this study, we proposed a new approach for AF detection: SiamAF. The proposed approach leverages a Siamese network architecture and a novel loss function, simultaneously learning from both ECG and PPG signal modalities. By learning common features between the two modalities (namely, peaks), our proposed model achieves superior AF detection performance compared to previous studies. SiamAF elevates the predictive power of the PPG signals to a level comparable to that of ECG without compromising model usability and improves PPG AF detection robustness even on noisy wearable data. Our findings offer a promising avenue for improving AF detection and potentially benefit both diagnosed AF patients and those unaware of their condition.


Disclosure Statement

The present work is partially supported by NIH grant R01HL166233 and NSF grant IIS-2130250.


References

Andersen, R. S., Peimankar, A., & Puthusserypady, S. (2019). A deep learning approach for real-time detection of atrial fibrillation. Expert Systems with Applications, 115, 465–473. https://doi.org/10.1016/j.eswa.2018.08.011

Anwar, S. M., Majid, M., Qayyum, A., Awais, M., Alnowami, M., & Khan, M. K. (2018). Medical image analysis using convolutional neural networks: A review. Journal of Medical Systems, 42(11), Article 226. https://doi.org/10.1007/s10916-018-1088-1

Beede, E., Baylor, E., Hersch, F., Iurchenko, A., Wilcox, L., Ruamviboonsuk, P., & Vardoulakis, L. M. (2020). A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Paper 589). Association for Computing Machinery. https://doi.org/10.1145/3313831.3376718

Centers for Disease Control and Prevention, National Center for Health Statistics. (n.d.). About multiple cause of death, 1999–2019. CDC WONDER Online Database website. Retrieved February 1, 2021, from https://wonder.cdc.gov/mcd-icd10.html

Chen, C., Hua, Z., Zhang, R., Liu, G., & Wen, W. (2020). Automated arrhythmia classification based on a combination network of CNN and LSTM. Biomedical Signal Processing and Control, 57, Article 101819. https://doi.org/10.1016/j.bspc.2019.101819

Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In H. Daumé III, & A. Singh (Eds.), Proceedings of the 37th International Conference on Machine Learning (pp. 1597–1607). Proceedings of Machine Learning Research. https://doi.org/10.5555/3524938.3525087

Chen, X., & He, K. (2021). Exploring simple Siamese representation learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15745–15753. IEEE. https://doi.org/10.1109/CVPR46437.2021.01549

Colilla, S., Crow, A., Petkun, W., Singer, D. E., Simon, T., & Liu, X. (2013). Estimates of current and future incidence and prevalence of atrial fibrillation in the U.S. adult population. American Journal of Cardiology, 112(8), 1142–1147. https://doi.org/10.1016/j.amjcard.2013.05.063

Corino, V. D. A., Laureanti, R., Ferranti, L., Scarpini, G., Lombardi, F., & Mainardi, L. T. (2017). Detection of atrial fibrillation episodes using a wristband device. Physiological Measurement, 38(5), Article 787. https://doi.org/10.1088/1361-6579/aa5dd7

DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics, 44(3), 837–845. https://doi.org/10.2307/2531595

Ding, C., Guo, Z. Rudin, C., Xiao, R., Shah, A., Do, D. H., Lee, R. J., Clifford, G., Nahab, F. B., & Hu, X. (2024). Learning from alarms: A robust learning approach for accurate photoplethysmography-based atrial fibrillation detection using eight million samples labeled with imprecise arrhythmia alarms. IEEE Journal of Biomedical and Health Informatics, 28(5), 2650–2661. https://doi.org/10.1109/JBHI.2024.3360952

Eerikainen, L. M., Bonomi, A. G., Schipper, F., Dekker, L. R. C., de Morree, H. M., Vullings, R., & Aarts, R. M. (2020). Detecting atrial fibrillation and atrial flutter in daily life using photoplethysmography data. IEEE Journal of Biomedical and Health Informatics, 24(6), 1610–1618. https://doi.org/10.1109/JBHI.2019.2950574

Fan, X., Yao, Q., Cai, Y., Miao, F., Sun, F., & Li, Y. (2018). Multiscaled fusion of deep convolutional neural networks for screening atrial fibrillation from single lead short ECG recordings. IEEE Journal of Biomedical and Health Informatics, 22(6), 1744–1753. https://doi.org/10.1109/jbhi.2018.2858789

Gotlibovych, I., Crawford, S., Goyal, D., Liu, J., Kerem, Y., Benaron, D., Yilmaz, D., Marcus, G., & Li, Y. (2018). End-to-end deep learning from raw sensor data: atrial fibrillation detection using wearables. KDD 2018 Deep Learning Day. https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_21.pdf

Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., Piot, B., kavukcuoglu koray, k., Munos, R., & Valko, M. (2020). Bootstrap your own latent - A new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33, 21271–21284. https://proceedings.neurips.cc/paper_files/paper/2020/file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf

Guo, Z., Ding, C., Hu, X., & Rudin, C. (2021). A supervised machine learning semantic segmentation approach for detecting artifacts in plethysmography signals from wearables. Physiological Measurement, 42(12), Article 125003. https://doi.org/10.1088/1361-6579/ac3b3d

Hannun, A. Y., Rajpurkar, P., Haghpanahi, M., Tison, G. H., Bourn, C., Turakhia, M. P., & Ng, A. Y. (2019). Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine, 25(1), 65–69. https://doi.org/10.1038/s41591-018-0268-3

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90

Kim, M.-G., & Pan, S. B. (2019). Deep learning based on 1-D ensemble networks using ECG for real-time user recognition. IEEE Transactions on Industrial Informatics, 15(10), 5656–5663. https://doi.org/10.1109/TII.2019.2909730

Kiranyaz, S., Ince, T., & Gabbouj, M. (2016). Real-time patient-specific ECG classification by 1-D convolutional neural networks. IEEE Transactions on Biomedical Engineering, 63(3), 664–675. https://doi.org/10.1109/TBME.2015.2468589

Krivoshei, L., Weber, S., Burkard, T., Maseli, A., Brasier, N., Kühne, M., Conen, D., Huebner, T., Seeck, A., & Eckstein, J. (2017). Smart detection of atrial fibrillation. EP Europace, 19(5), 753–757. https://doi.org/10.1093/europace/euw125

Kumar, M., Pachori, R. B., & Rajendra Acharya, U. (2018). Automated diagnosis of atrial fibrillation ECG signals using entropy features extracted from flexible analytic wavelet transform. Biocybernetics and Biomedical Engineering, 38(3), 564–573. https://doi.org/10.1016/j.bbe.2018.04.004

Lippi, G., Sanchis-Gomar, F., & Cervellin, G. (2021). Global epidemiology of atrial fibrillation: An increasing epidemic and public health challenge. International Journal of Stroke, 16(2), 217–221. https://doi.org/10.1177/1747493019897870

Makowski, D., Pham, T., Lau, Z. J., Brammer, J. C., Lespinasse, F., Pham, H., Schölzel, C., & Chen, S. H. A. (2021). NeuroKit2: A Python toolbox for neurophysiological signal processing. Behavior Research Methods, 53(4), 1689–1696. https://doi.org/10.3758/s13428-020-01516-y

Martis, R. J., Acharya, U. R., Adeli, H., Prasad, H., Tan, J. H., Chua, K. C., Too, C. L., Yeo, S. W. J., & Tong, L. (2014). Computer aided diagnosis of atrial arrhythmia using dimensionality reduction methods on transform domain representation. Biomedical Signal Processing and Control, 13, 295–305. https://doi.org/10.1016/j.bspc.2014.04.001

Oh, S. L., Ng, E. Y., Tan, R. S., & Acharya, U. R. (2018). Automated diagnosis of arrhythmia using combination of CNN and LSTM techniques with variable length heart beats. Computers in Biology and Medicine, 102, 278–287. https://doi.org/10.1016/j.compbiomed.2018.06.002

Poh, M.-Z., Poh, Y. C., Chan, P.-H., Wong, C.-K., Pun, L., Leung, W. W.-C., Wong, Y.-F., Wong, M. M.-Y., Chu, D. W.-S., & Siu, C.-W. (2018). Diagnostic assessment of a deep learning system for detecting atrial fibrillation in pulse waveforms. Heart, 104(23), 1921–1928. https://doi.org/10.1136/heartjnl-2018-313147

Pourbabaee, B., Roshtkhari, M. J., & Khorasani, K. (2018). Deep convolutional neural networks and learning ECG features for screening paroxysmal atrial fibrillation patients. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(12), 2095–2104. https://doi.org/10.1109/TSMC.2017.2705582

Qayyum, A., Meriaudeau, F., & Chan, G. C. Y. (2018). Classification of atrial fibrillation with pre-trained convolutional neural network models. In Proceedings of the 2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (pp. 594–599). IEEE. https://doi.org/10.1109/IECBES.2018.8626624

Salih, M., Abdel-Hafez, O., Ibrahim, R., & Nair, R. (2021). Atrial fibrillation in the elderly population: Challenges and management considerations. Journal of Arrhythmia, 37(4), 912–921. https://doi.org/10.1002/joa3.12580

Sannino, G., & De Pietro, G. (2018). A deep learning approach for ECG-based heartbeat classification for arrhythmia detection. Future Generation Computer Systems, 86, 446–455. https://doi.org/10.1016/j.future.2018.03.057

Sañudo, B., De Hoyo, M., Muñoz-López, A., Perry, J., & Abt, G. (2019). Pilot study assessing the influence of skin type on the heart rate measurements obtained by photoplethysmography with the apple watch. Journal of Medical Systems, 43(7), Article 195. https://doi.org/10.1007/s10916-019-1325-2

Shan, S.-M., Tang, S.-C., Huang, P.-W., Lin, Y.-M., Huang, W.-H., Lai, D.-M., & Wu, A.-Y. A. (2016). Reliable PPG-based algorithm in atrial fibrillation detection. In Proceedings of the 2016 IEEE Biomedical Circuits and Systems Conference (pp. 340–343). IEEE. https://doi.org/10.1109/BioCAS.2016.7833801

Shashikumar, S. P., Shah, A. J., Li, Q., Clifford, G. D., & Nemati, S. (2017). A deep learning approach to monitoring and detecting atrial fibrillation using wearable technology. In Proceedings of the 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (pp. 141–144). IEEE. https://doi.org/10.1109/BHI.2017.7897225

Sun, X., & Xu, W. (2014). Fast implementation of DeLong’s algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Processing Letters, 21(11), 1389–1393. https://doi.org/10.1109/LSP.2014.2337313

Tang, S.-C., Huang, P.-W., Hung, C.-S., Shan, S.-M., Lin, Y.-H., Shieh, J.-S., Lai, D.-M., Wu, A.-Y., & Jeng, J.-S. (2017). Identification of atrial fibrillation by quantitative analyses of fingertip photoplethysmogram. Scientific Reports, 7, Article 45644. https://doi.org/10.1038/srep45644

Tison, G. H., Sanchez, J. M., Ballinger, B., Singh, A., Olgin, J. E., Pletcher, M. J., Vittinghoff, E., Lee, E. S., Fan, S. M., Gladstone, R. A., Mikell, C., Sohoni, N., Hsieh, J., & Marcus, G. M. (2018). Passive detection of atrial fibrillation using a commercially available smartwatch. JAMA Cardiology, 3(5), 409–416. https://doi.org/10.1001/jamacardio.2018.0136

Torres-Soto, J., & Ashley, E. A. (2020). Multi-task deep learning for cardiac rhythm detection in wearable devices. NPJ Digital Medicine, 3, Article 116. https://doi.org/10.1038/s41746-020-00320-4

Tutuko, B., Nurmaini, S., Tondas, A. E., Rachmatullah, M. N., Darmawahyuni, A., Esafri, R., Firdaus, F., & Sapitri, A. I. (2021). AFibNet: An implementation of atrial fibrillation detection with convolutional neural network. BMC Medical Informatics and Decision Making, 21(1), Article 216. https://doi.org/10.1186/s12911-021-01571-1

van Gent, P., Farah, H., van Nes, N., & van Arem, B. (2019a). Analysing noisy driver physiology real-time using off-the-shelf sensors: Heart rate analysis software from the Taking the Fast Lane project. Journal of Open Research Software, 7(1), Article 32. https://doi.org/10.5334/jors.241

van Gent, P., Farah, H., van Nes, N., & van Arem, B. (2019b). HeartPy: A novel heart rate algorithm for the analysis of noisy signals. Transportation Research Part F: Traffic Psychology and Behaviour, 66, 368–378. https://doi.org/10.1016/j.trf.2019.09.015

Voisin, M., Shen, Y., Aliamiri, A., Avati, A., Hannun, A., & Ng, A. (2019). Ambulatory atrial fibrillation monitoring using wearable photoplethysmography with deep learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1909–1916). Association for Computing Machinery. https://doi.org/10.1145/3292500.3330657

Wang, J. (2020). A deep learning approach for atrial fibrillation signals classification based on convolutional and modified Elman neural network. Future Generation Computer Systems, 102, 670–679. https://doi.org/10.1016/j.future.2019.09.012

Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021). Understanding how dimension reduction tools work: An empirical approach to deciphering t-SNE, UMAP, TriMap, and PaCMAP for data visualization. Journal of Machine Learning Research, 22, Article 201. http://jmlr.org/papers/v22/20-1061.html

Xia, Y., Wulan, N., Wang, K., & Zhang, H. (2018). Detecting atrial fibrillation by deep convolutional neural networks. Computers in Biology and Medicine, 93, 84–92. https://doi.org/10.1016/j.compbiomed.2017.12.007

Ying, X. (2019). An overview of overfitting and its solutions. Journal of Physics: Conference Series, 1168(2), Article 022022. https://doi.org/10.1088/1742-6596/1168/2/022022

Zech, J. R., Badgeley, M. A., Liu, M., Costa, A. B., Titano, J. J., & Oermann, E. K. (2018). Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine, 15(11), Article e1002683. https://doi.org/10.1371/journal.pmed.1002683

Zhang, Y., Xiang, T., Hospedales, T. M., & Lu, H. (2018). Deep mutual learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (pp. 4320–4328). IEEE. https://doi.org/10.1109/CVPR.2018.00454


Appendix

A.1. Bootstrap by Patient Calculations

We conducted bootstrapping tests with N=1,000N=1,000 samples to calculate the confidence interval for statistical significance comparisons. Each bootstrap sample is constructed by randomly sampling patients with replacements from each test set (i.e., PP random patients from each test set, where the test set contains PP unique patients). For each bootstrap sample, AUROC (area under the receiver-operator-characteristic curve) and AUPRC (area under the precision-recall curve) scores were calculated and collected. We calculated the 95% confidence interval (CI) using μ±(1.96)σN\mu \pm (1.96)\frac{\sigma}{\sqrt{N}}; here μ\mu is the mean of the collected bootstrap samples’ AUPRC or AUROC scores, σ\sigma is the standard deviation of the collected bootstrap samples’ AUPRC or AUROC scores, and N=1,000N = 1,000 is the number of bootstrap samples.

A.2. Statistical Significance Test Results

This section contains the p values of the pairwise statistical significance tests between tested models. For comparing the AUROC scores between any two models, we used the pairwise Delong test (DeLong et al., 1988) and the Python FastDelong package (Sun & Xu, 2014). For comparing the AUPRC scores between any two models, we used the pairwise ttest. After adjusting the threshold pp value with Bonferroni correction (pthresh=0.05# of testsp_{\textit{thresh}} = \frac{0.05}{\textit{\# of tests}}), all pairwise comparisons are statistically significant except when comparing the proposed method and DeepBeat’s AUPRC scores on the Standford data set—in this case, our proposed model performed comparably to the DeepBeat baseline on DeepBeat’s in-distribution test set. The tables below shows the p value results. Both \leftarrow and \downarrow indicates the model with the better performance.

Table A1. Pairwise pAUROCp_{\textit{AUROC}} and pAUPRCp_{\textit{AUPRC}} values between all tested models on the Simband photoplethysmography (PPG) data set.

ResNet-34 (PPG)

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

DML

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00 \leftarrow, p_{\textit{AUPRC}}=0.00\leftarrow

DeepBeat

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00 \downarrow, p_{\textit{AUPRC}}=0.00 \downarrow

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

SiamAF

ResNet-34 (PPG)

DML


Table A2. Pairwise pAUROCp_{\textit{AUROC}} and pAUPRCp_{\textit{AUPRC}} values between all tested models on the Institution B photoplethysmography (PPG) data set.

ResNet-34 (PPG)

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

DML

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\leftarrow

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\leftarrow, p_{\textit{AUPRC}}=0.00\leftarrow

DeepBeat

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

SiamAF

ResNet-34 (PPG)

DML


Table A3. Pairwise pAUROCp_{\textit{AUROC}} and pAUPRCp_{\textit{AUPRC}} values between all tested models on the Stanford data set. PPG = photoplethysmography.

ResNet-34 (PPG)

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

DML

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\leftarrow, p_{\textit{AUPRC}}=0.00\leftarrow

DeepBeat

pAUROC=0.00,pAUPRC=0.0046p_{\textit{AUROC}}=0.00\leftarrow, p_{\textit{AUPRC}}=0.0046

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\leftarrow, p_{\textit{AUPRC}}=0.00\leftarrow

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\leftarrow, p_{\textit{AUPRC}}=0.00\leftarrow

SiamAF

ResNet-34 (PPG)

DML


Table A4. Pairwise pAUROCp_{\textit{AUROC}} and pAUPRCp_{\textit{AUPRC}} values between all tested models on the Simband electrocardiogram (ECG) data set.

ResNet-34 (ECG)

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

DML

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\leftarrow, p_{\textit{AUPRC}}=0.00\leftarrow

SiamAF

ResNet-34 (ECG)


Table A5. Pairwise pAUROCp_{\textit{AUROC}} and pAUPRCp_{\textit{AUPRC}} values between all tested models on the Institution B electrocardiogram (ECG) data set.

ResNet-34 (ECG)

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

DML

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\downarrow, p_{\textit{AUPRC}}=0.00\downarrow

pAUROC=0.00,pAUPRC=0.00p_{\textit{AUROC}}=0.00\leftarrow, p_{\textit{AUPRC}}=0.00\leftarrow

SiamAF

ResNet-34 (ECG)

A.3. Institution A Participant Demographic Information

We display the Institution A participant demographic information in detail in Table A6.

Table A6. Patient demographic information in Institution A data set.

Demographic Category

Patient Counts

Percentage

Gender

Male

15,330

53.7%

Female

13,203

46.2%

Others

6

0%

Age

<22 years

3,925

13.8%

22–39 years

2,715

9.5%

40–54 years

4,372

25.3%

55–64 years

5,370

18.8%

\geq65 years

12,157

42.6%

Race

White or Caucasian

15,890

55.7%

Black or African American

2,159

7.4%

Asian

4,364

15.0%

Native Hawaiian or Other Pacific Islander

426

1.46%

American Indian or Alaska Native

212

0.7%

Unknown/Declined

1,149

4%

Others

4,913

16.9%

A4. Performance Comparisons and Values

Section A.4. includes the detailed evaluation results.

Table A7. AUROC (area under the receiver-operator-characteristic curve) comparisons of models on different test sets. N/A means that the method only works for the other modality, that is, either electrocardiogram (ECG) or photoplethysmography (PPG).

Data set/AUROC

ResNet-34(PPG)

ResNet-34(ECG)

DML

DeepBeat

SiamAF

SiamAF (1% training labels)

Simband (ECG)

N/A

0.724 [0.722 0.727]

0.721 [0.718 0.723]

N/A

0.747 [0.744 0.75]

0.729 [0.726 0.732]

Simband (PPG)

0.879 [0.878 0.881]

N/A

0.891 [0.889 0.892]

0.870 [0.868 0.871]

0.914 [0.913 0.916]

0.9 [0.898 0.901]

Institution B (ECG)

N/A

0.89 [0.887 0.893]

0.905 [0.902 0.908]

N/A

0.927 [0.925 0.929]

0.899 [0.897 0.902]

Institution B (PPG)

0.918 [0.916 0.92]

N/A

0.92 [0.918 0.922]

0.872 [0.87 0.875]

0.924 [0.922 0.925]

0.907 [0.905 0.909]

Stanford test set

0.763 [0.761 0.764]

N/A

0.764 [0.763 0.766]

0.883 [0.882 0.884]

0.877 [0.876 0.878]

0.8 [0.799 0.801]


Table A8. AUPRC (area under the precision-recall curve) comparisons of models on different test sets. ECG = electrocardiogram; PPG = photoplethysmography.

Data set/AUPRC

ResNet-34 (PPG)

ResNet-34 (ECG)

DML

DeepBeat

SiamAF

SiamAF (1% training labels)

Simband (ECG)

N/A

0.621 [0.617 0.625]

0.675 [0.671 0.679]

N/A

0.732 [0.729 0.736]

0.661 [0.658 0.665]

Simband (PPG)

0.841 [0.838 0.843]

N/A

0.847 [0.844 0.845]

0.799 [0.796 0.803]

0.865 [0.863 0.868]

0.841 [0.838 0.844]

Institution B (ECG)

N/A

0.726 [0.72 0.732]

0.749 [0.743 0.755]

N/A

0.765 [0.759 0.772]

0.736 [0.731 0.742]

Institution B (PPG)

0.768 [0.762 0.774]

N/A

0.778 [0.772 0.784]

0.607 [0.601 0.614]

0.773 [0.768 0.779]

0.744 [0.739 0.749]

Stanford test set

0.582 [0.577 0.586]

N/A

0.613 [0.609 0.617]

0.726 [0.723 0.73]

0.726 [0.722 0.729]

0.565 [0.56 0.569]

A.5. λ\lambda value selection and evaluated performances

Figure A6. The test set performance of our proposed model with different λ value settings ranging from 0.2 to 1.2 with 0.2 increments. AUROC = area under the receiver-operator-characteristic curve; AUPRC = area under the precision-recall curve; ECG = electrocardiogram; PPG = photoplethysmography.


Data Repository/Code

All development code for this study is available publicly at https://github.com/chengstark/SiamAF. The retrospective use of the Institution A data set in this study is conducted according to the terms of a data use agreement between UCSF and Emory University. As a result, the data set is not available for public use without additional institutional approvals. Similarly, the Institution B data set is not available for public use. To obtain access to the private data sets utilized in this study, please contact the corresponding author. Requests will undergo evaluation in accordance with the data use agreement established between institutions and may necessitate escalation to the relevant institutional data governance committee(s). The remaining external data sets used for evaluation in this study are publicly available.


©2025 Zhicheng Guo, Cheng Ding, Duc Do, Amit Shah, Randall J. Lee, Xiao Hu, and Cynthia Rudin. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
0
comment
No comments here
Why not start the discussion?