Atrial fibrillation (AF) is the most common type of cardiac arrhythmia. It is associated with an increased risk of stroke, heart failure, and other cardiovascular complications, but can be clinically silent. The wide availability of consumer wearables has made continuous AF monitoring possible thanks to built-in electrocardiogram (ECG) and photoplethysmography (PPG) sensors. ECG is the gold standard for AF diagnosis, but it cannot be passively collected on consumer devices due to its requirement that the subject is stationary, making it impractical for continuous monitoring. On the other hand, PPG offers the possibility for passive monitoring, but it is susceptible to human movements and other environmental factors, posing a significant challenge for AF detection due to noise and artifacts. Ideally, we would get the best of both worlds: the accuracy of AF detection with ECG and the broad applicability of PPG. We propose a new approach, SiamAF, to bridge the gap between ECG and PPG signals that leverages ECG and PPG signal pairs available from wearables and hospital monitors. Our method helps elevate the predictive power of the (broadly applicable but noisy) PPG signals to a level comparable to that of (less applicable but more accurate) ECG. Specifically, we train a deep neural network to learn similar information from both ECG and PPG in a contrastive fashion, which encourages it to learn medically relevant features in both ECG and PPG signals. At inference time, the proposed model is able to accurately predict AF from either PPG or ECG and outperforms baseline methods on three external test sets.
Keywords: atrial fibrillation, PPG, ECG, Siamese network, contrastive learning, joint training
Atrial fibrillation (AF) is a common heart rhythm disorder, often without noticeable symptoms. Continuous and passive monitoring of AF is now possible due to the growing popularity of consumer wearable devices with built-in electrocardiogram (ECG) and photoplethysmography (PPG) sensors. Clinically, ECG is the gold standard for diagnosing AF, however, collecting ECG requires the patient to remain stationary, rendering it unsuitable for daily monitoring. On the other hand, PPG sensors can be used for passive monitoring during normal daily activities, but PPG is easily corrupted by movements and environmental factors such as ambient light in the room. The noise in the PPG signals poses significant challenges to AF detection. Ideally, we would get the best of both worlds: the accuracy of AF detection with ECG and the broad applicability of PPG. We propose a new approach, SiamAF, to bridge the gap between ECG and PPG signals that leverages ECG and PPG signal pairs available from wearables and hospital monitors. By encouraging our deep learning model to extract similar information from ECG and PPG signals, the model focuses more on medically relevant features in both ECG and PPG signals. Our method thus helps elevate the predictive power of the (broadly applicable but noisy) PPG signals to a level comparable to that of (less applicable but more accurate) ECG. Through external validations on data across several hospitals and wearable devices, we demonstrate that SiamAF outperforms other methods, offering robust and accurate AF detection.
Atrial fibrillation (AF) is the most prevalent form of cardiac arrhythmia affecting approximately 1–2% of the general population and up to 9% of individuals aged 65 years or older (Colilla et al., 2013; Salih et al., 2021). AF contributed to over 183,000 deaths in the United States alone in 2019 (Centers for Disease Control and Prevention, National Center for Health Statistics, n.d.) and its prevalence is projected to continuously increase in the next 30 years (Lippi et al., 2021). Paroxysmal AF episodes often present few (or no) symptoms—that patients do not usually notice immediately—that are crucial precursors to more serious health conditions, including ischemic stroke and congestive heart failure. Therefore, the early detection of AF episodes has become imperative in the treatment and prevention of cardiovascular disease.
In recent years, AF detection has been transformed by the growing popularity of wearable devices with photoplethysmography (PPG) sensors, such as smartwatches. Unlike ECG signal collection which requires electrode contacts, PPG signals are effortlessly gathered through photonic sensors, enabling uninterrupted monitoring during daily activities. These PPG-capable devices are widely available, providing a noninvasive means for continuous heart rhythm monitoring.
While commercially available smartwatches have expanded the scope of passive monitoring for potential AF, the electrocardiogram (ECG) remains the clinical gold standard for diagnosing atrial fibrillation due to its superior accuracy and the wealth of diagnostic information compared to PPG. Specifically, PPG signals collected from smartwatches are not reliable enough to form AF diagnoses. This is primarily due to their susceptibility to environmental noise and human movements, which poses challenges for automated downstream AF detectors (Guo et al., 2021; Sañudo et al., 2019). Patients often still require hospital visits and undergo stationary ECG monitoring to be diagnosed with AF. This reliance on traditional methods creates a bottleneck in the AF diagnostic process due to the human effort involved. As a result, it is imperative to enhance PPG AF detection performance by leveraging the medically relevant information in PPG to match what we see in ECG signals.
While there has been much work using machine learning to improve AF detection from PPG (Corino et al., 2017; Eerikainen et al., 2020; Kiranyaz et al., 2016; Krivoshei et al., 2017; Kumar et al., 2018; Martis et al., 2014; Poh et al., 2018; Shan et al., 2016; Shashikumar et al., 2017; Tang et al., 2017; Torres-Soto & Ashley, 2020; Tutuko et al., 2021; Voisin et al., 2019; J. Wang, 2020), there is much room for improvement. It is possible, for instance, that subtle information in PPG becomes more apparent in a simultaneous ECG; as discussed above, ECG is much easier to learn from (but cannot be collected nearly as easily). Past work used signals from PPG and labels from ECG, since only PPG is available at test time. However, this setup may be inefficient: it may require an enormous amount of PPG data to learn the more subtle signals in PPG that are more visible in simultaneous ECG.
In this work, we hypothesize that important information from PPG can be learned more efficiently if we leverage ECG during training—even if it is not available at test time. That is, by learning a shared latent space between PPG and ECG signals and encouraging the model to learn similar features between ECG and PPG, relevant information that would be much harder to learn from PPG alone would become easier to identify if it were ‘mapped’ to ECG in the latent space. By this logic, it is possible that test performance from PPG signals could improve, even so far as if we had trained the model on (clean) ECG rather than (noisy) PPG. This is precisely the goal of the present work.
In addition, by training with both ECG and PPG, from a model learning perspective, PPG and ECG can serve as regularizations for each other during training against potential mislabeling and poor signal quality in the data set.
Analogously, this same idea could be used to improve ECG analysis by using information in PPG more effectively, so that even ECG test performance could improve using the shared latent space. In fact, for both PPG-only test sets and ECG-only test sets, our experiments show improved performance over other approaches.
This study has the following contributions:
To our knowledge, this is the first study leveraging shared information between ECG and PPG signals for AF detection. Our novel deep learning architecture and loss function design encourage the model to learn shared information and improve prediction robustness.
The proposed method outperforms baseline methods, including deep mutual learning (Zhang et al., 2018) and previous CNN-based ECG or PPG single modality AF detection networks (Torres-Soto & Ashley, 2020) on three external test sets containing diverse patient conditions and recording hardware.
We investigate the information learned by the model and verify the effectiveness of our proposed method through dimension reduction and visualizations of latent features.
Rather than learning two separate models for ECG and PPG modalities, the proposed method learns a single model that can be used for AF detection on both ECG or PPG signals with no performance sacrifice for either modality.
We consider related work in AF detection with hand-crafted features, deep learning for AF detection, and Siamese networks.
Hand-Crafted Features for AF Detection. Multiple past works have developed hand-crafted features of the ECG or PPG input signals for training machine learning–based AF classifiers. These include the root mean square of the successive difference (RMSSD) of peak-to-peak intervals, Shannon entropy (ShE), spectral analysis, dynamic time warping for waveform shape analysis, and template matching, as well as other statistical features such as mean, variance, standard deviation, skewness, and kurtosis of input signals (Corino et al., 2017; Eerikainen et al., 2020; Krivoshei et al., 2017; Kumar et al., 2018; Martis et al., 2014; Shan et al., 2016; Tang et al., 2017). These features are fed into standard machine learning classification algorithms.
Most hand-crafted features rely on accurate peak detection in the raw signals, which is often unreliable, given poor signal qualities. These challenges have made the hand-crafted features approach difficult for applications to different populations and demographics.
Deep Learning for AF Detection. Deep learning networks automatically learn and extract features from raw inputs. Deep convolutional neural networks (DCNN/CNN) have been popular as feature extractors for both images (Anwar et al., 2018) and time series due to their unique ability to preserve useful information. There have been several CNN-based AF detection algorithms (Kiranyaz et al., 2016; Poh et al., 2018; Shashikumar et al., 2017; Torres-Soto & Ashley, 2020; Tutuko et al., 2021; Voisin et al., 2019; J. Wang, 2020). Some researchers convert the 1-D signals into 2D images using the short-term Fourier transform (STFT) and stationary wavelet transform (SWT) (Qayyum et al., 2018; Xia et al., 2018) for 2D-CNNs. Due to the time series nature of ECG and PPG signals, many researchers leverage recurrent neural networks (RNNs) for AF detection due to their ability to capture temporal relations and their flexibility with variable length inputs. Some of its variants, including the long short-term memory network (LSTM) and recurrent-convolutional neural network, have also been used for AF detection (Andersen et al., 2019; C. Chen et al., 2020; Fan et al., 2018; Gotlibovych et al., 2018; Oh et al., 2018). While deep neural networks are more adaptable than hand-crafted, features-based machine learning methods, they require significantly more labeled data to achieve performance equivalent to the best AF detectors. Deep learning models are prone to overfitting, especially on trivial task-irrelevant features. This is often demonstrated through the sensitivity of deep neural nets to signals that are low quality and contain signal artifacts; we can expect worse performances of the trained models on these low-quality signals (Ding et al., 2024; Ying, 2019). Deep learning methods also suffer from a lack of interpretability. It is difficult, if not impossible, for humans to understand the discovered features and decision processes of the deep neural networks; combined with the susceptibility to overfitting, it leads to an increased risk of undetected failure in real-world applications when operating conditions change or the model fails to generalize (Beede et al., 2020; Zech et al., 2018).
Two Modalities. Previous studies only trained their models using either ECG or PPG signals (Andersen et al., 2019; C. Chen et al., 2020; Corino et al., 2017; Eerikainen et al., 2020; Fan et al., 2018; Gotlibovych et al., 2018; Hannun et al., 2019; Kim & Pan, 2019; Kiranyaz et al., 2016; Krivoshei et al., 2017; Kumar et al., 2018; Martis et al., 2014; Oh et al., 2018; Poh et al., 2018; Pourbabaee et al., 2018; Qayyum et al., 2018; Sannino & De Pietro, 2018; Shan et al., 2016; Shashikumar et al., 2017; Tang et al., 2017; Tison et al., 2018; Torres-Soto & Ashley, 2020; Tutuko et al., 2021; Voisin et al., 2019; J. Wang, 2020; Xia et al., 2018), despite the availability of both modalities measured simultaneously on the same individuals. As we will show, both ECG and PPG signals carry crucial information for detecting AF; neither modality should be wasted during training.
Siamese Networks. The Siamese network architecture (models with two branches or subnetworks) is useful for dual input scenarios. In recent years, Siamese networks have seen a rise in popularity due to their usefulness for self-supervised learning (T. Chen et al., 2020; X. Chen & He, 2021; Grill et al., 2020). However, Siamese self-supervised networks are difficult to train and rely heavily on image augmentations.
Figure 1. The architecture of our proposed framework for learning from photoplethysmography (PPG) and electrocardiogram (ECG) and predicting (i.e., testing) on only either ECG or PPG. At training time, the model takes both ECG and PPG signals as inputs. In each training iteration, the two signal modalities take turns flowing through the network following each of the colored paths; that is, the PPG signal flows through the red path and the ECG signal flows through the purple path, then the ECG signal flows through the red path and the PPG signal flows through the purple path. In the configuration shown in the figure, the PPG and ECG signal inputs will go through the encoder
The training data is denoted as
A neural network encoder
A multilayer-perception projector
A linear layer predictor
A supervised branch classification head
A loss function combining both the agreement and classification objectives. The agreement objective is to maximize the cosine distance between the
Here
where
In cases of incomplete ECG and PPG pairs (either ECG or PPG signals) in the training data, we can accommodate that in our framework by adding two additional loss terms, one for additional PPG data and one for additional ECG data:
In addition to the previously introduced model information, Algorithm 1 illustrates the training algorithm of the proposed framework. After the training is complete, we discard the projector
input :
|
We used a large-scale data set from Institution A (University of California San Francisco medical center) for training and three additional data sets for evaluation, including a data set from Institution B (University of California Los Angeles medical center), as well as the Stanford data set test split (Torres-Soto & Ashley, 2020), and the Simband data set (Shashikumar et al., 2017).
The Institution A (University of California San Francisco Medical Center) data set contains 28,539 patients in hospital settings; the patients’ continuous ECG and PPG signals were recorded from the bedside monitors. The bedside monitor produced alarms for events including atrial fibrillation (AF), premature ventricular contraction (PVC), tachycardia, Couplet (consecutive abnormal heart beats), and so on. This study focuses on AF, PVC, and normal sinus rhythm (NSR). The samples with PVC and NSR labels were combined into the Non-AF samples group, thus forming the AF vs. Non-AF binary classification task. The continuous ECG and PPG signals were sliced into 30-second nonoverlap segments sampled at 240 Hz (each signal strip contains 7,200 timesteps). The 30-second segments were then downsampled to 2,400 timesteps (80-Hz sampling rate). During the preprocessing step, invalid samples (e.g., empty signals files, missing ECG or PPG signals) were also filtered out. There are four ECG channels in this data set; we used the first ECG channel for our study due to its resemblance to wearable device outputs. The data set was split into the train and validation splits by patient IDs. The train split of the Institution A data set contains 13,432 patients, 2,757,888 AF signal segments, and 3,014,334 Non-AF signal segments; the validation split contains 6,616 patients, 1,280,775 AF segments, and 1,505,119 Non-AF segments. Due to the automatic nature of bedside monitor–generated labels, the data set likely contains label noise. For a detailed description of the alarm labeling process and the additional preprocessing steps, please see our parallel work (Ding et al., 2024). For detailed demographic information of patients who participated in this study, please refer to Table A6. The data collection process followed strict guidelines and the study was approved by the Institutional Review Board (IRB number:14-13262). A waiver of informed patient consent was granted for this study, as the investigation solely involved the analysis of de-identified patient data.
The Institution B (University of California Los Angeles Medical Center) data set contains 126 patients in hospital settings and simultaneous continuous ECG and PPG signals were collected at Institution B. The patients have a minimum age of 18 and a maximum age of 95 years old and were admitted from April 2010 to March 2013. The continuous signals were sliced into 30-second nonoverlapping segments and again downsampled to 80 Hz sampling rate with 2,400 timesteps in each signal. The data set contains 38,910 AF and 220,740 Non-AF segments. Board-certified cardiac electrophysiologists annotated all AF episodes in the Institution B data sets. Here the PPG signals are collected from the fingertips. The use of this data set for the study was approved by the IRB (IRB approval number: 10-000545). A waiver of informed patient consent was obtained for this study.
The Simband data set (Shashikumar et al., 2017) contains 98 patients in ambulatory settings from Emory University Hospital (EUH), Emory University Hospital Midtown (EUHM), and Grady Memorial Hospital (GMH). The patients have a minimum age of 18 years old and a maximum age of 89 years old; patients were admitted from October 2015 to March 2016. The ECG signals were collected using Holter monitors, and the PPG signals were collected from a wrist-worn Samsung Simband. The signals used for testing were 30-second segments with 2,400 timesteps after preprocessing. This data set contains 348 AF segments and 506 Non-AF segments. The signals in this data set were reviewed and annotated by an ECG technician, physician study coordinator, and cardiologist. The data were collected with patient consent and IRB approval (IRB approval number: 00084629).
The Stanford data set (Torres-Soto & Ashley, 2020) contains 107 AF patients, 15 paroxysmal AF patients, and 41 healthy patients. The 41 healthy patients also undergo an exercise stress test. All signals in this data set were recorded in ambulatory settings. The ECG signals were collected from an ECG reference device, and the PPG signals were collected from a wrist-worn device. The signals were sliced into 25-second segments by the original author. In this study, we augmented the signal to 30 seconds by pasting the first 5 seconds of each signal to the end and resampled the signals to 2,400 timesteps. The data set contains 52,911 AF segments and 80,620 Non-AF segments. In the evaluations, we use the test split generated by the authors of the Stanford data set. The PPG signals in this data set were manually annotated and reviewed by several cardiologists following computerized reference ECG signals.
In this section, we aim to answer the questions on how the SiamAF performs compared to other baseline methods (Section 5.1), we also investigate the learned latent space and empirically verify that the model learns the medically relevant and shared information between ECG and PPG signals (Section 5.2).
We verify our framework’s performance advantage by comparing it to three baseline methods: (1a) A ResNet-34 (He et al., 2016) classifier trained on only PPG data; (1b) A ResNet-34 (He et al., 2016) classifier trained on only ECG data; (2) A deep mutual learning (Zhang et al., 2018) (DML for short) network using ResNet-34 as encoders; (3) The public Stanford DeepBeat model (Torres-Soto & Ashley, 2020), which was trained on the Stanford data set. The ResNet-34 baselines serve as representations of performance in previous deep learning AF detection studies, where convolutional neural networks appear as a popular choice (Kiranyaz et al., 2016; Poh et al., 2018; Shashikumar et al., 2017; Tutuko et al., 2021; Voisin et al., 2019; J. Wang, 2020). The DML baseline is known for its ability to extract shared information from dual inputs through ‘mutual learning,’ which could be suitable for our intended use. The performance of the models is evaluated on the three external data sets we introduced in Section 4. In this work, we do not compare against hand-crafted feature-based AF classifiers due to their lackluster performance compared to deep learning methods, demonstrated in previous works (Torres-Soto & Ashley, 2020). We trained two ResNet-34 baselines for the ECG and PPG signals. The DML baseline contains two separate models for the two signal modalities. The Stanford DeepBeat baseline only operates on PPG signals. We will use both AUROC (area under the receiver-operator-characteristic curve) and AUPRC (area under the precision-recall curve) as evaluation metrics. We also conducted bootstrapping tests with 1,000 bootstrapping samples for statistical significance comparisons.
The evaluation results are shown in Figure 2. In the PPG test sets, the proposed method SiamAF performs comparably or substantially better than the baseline methods; in the ECG test sets, our method outperforms the baseline methods by a significant margin. For detailed values and the description of the bootstrapping calculation, please refer to Section A.1, Tables A7 and A8 in the Appendix.
Figure 2. The performance comparisons of all models on the test sets. The Stanford data set does not contain electrocardiogram (ECG) signals and thus cannot be used for testing ECG models. All models evaluated in this experiment were trained on the Institution A data set, except for the DeepBeat model, which was trained on the Stanford data set. In both plots, the horizontal axis is the AUPRC (area under the precision-recall curve) score, and the vertical axis is the AUROC (area under the receiver-operator-characteristic curve) score. The closer to the top right corner, the better. Here, confidence intervals are generally smaller than the dot sizes, so we provide a zoom-in of one of the dots to demonstrate the 95% CI. For exact CI and statistical significance test results, please refer to Section A.4 and Section A.2.
ECG and PPG signals contain similar data, including RR (heartbeat) intervals and peaks, which are important diagnostic indicators for AF by human experts. We designed our model based on these facts, by encouraging it to learn and exploit the shared information between the two modalities for AF detection. Ideally, our model should behave similarly to humans utilizing this shared information (e.g., RR intervals) for the AF detection task. To verify our design approach, we explore the learned representation of the encoder in our proposed framework through dimension reduction and visualization of the activations from layers in the network. For visualization purposes, we randomly subsampled 1% of the validation set and passed the samples through the learned ResNet-34 encoder in our proposed method.
Medically, signal peaks and peak intervals from the PPG and ECG signals are crucial information for AF detection. Naturally, in a robust model, peak information should likewise contribute substantially toward AF classification. For nonnoisy data, the peaks are time-synchronized between PPG and ECG signals. By using only the peaks from the PPG signal, we can capture essential information for AF detection that is comparable to the information obtained from using the entire PPG signal. Likewise, the model should behave similarly when we retain only the peaks in ECG signals to when we retain the full ECG signal. Finally, the PPG and ECG signals contain similar information to each other, which means that for AF detection, they likely will map to the same neighborhood in the latent space.
We can investigate the expected behaviors through visualization of the latent space. To do so, we extract peak information from the PPG signals using the HeartPy toolkit (van Gent et al., 2019a, 2019b) and the R peaks in the ECG signals with NeuroKit2 (Makowski et al., 2021). We create new peak-only PPG signals that are set to 0 everywhere except at the peaks and we do the same for ECG, preserving only all R peaks. Figure 3 shows an example of the PPG detection and the peak-only signals. These peak-only signals allow us to feed the model only peak information and remove other waveform information. Thus we aim to explore two hypotheses—whether (1) the peak-only PPG signals map to the same location in latent space as full PPG signals (and similarly for EEG) and (2) whether the PPG and EEG signals from the same time period of the same patient map to nearby locations in latent space.
Figure 3. An example of the peak detection results on a pair of time-synchronized (simultaneous) 30-second electrocardiogram (ECG) and photoplethysmography (PPG) signals and the corresponding encoded peak sequences. Here, the red dots indicate the detected peaks.
We use PaCMAP (Y. Wang et al., 2021) to visualize the feature outputs of each pooling layer in SiamAF, shown in Figure 4. We also visualized feature outputs for both ECG and PPG-only and the DML baseline models in Figure 4. It is obvious that at the later stage of the learned encoder in our proposed framework, both the PPG and ECG peak-only-signal features (labeled as “PPG Peaks” and “ECG R Peaks”) overlap almost perfectly with the original PPG and ECG-signal features. This lends support to hypotheses (1) and (2) described in the last paragraph. In contrast, Figure 5 shows that for the ResNet-34 and DML approaches, the encoded peak-only features never overlap with the PPG and ECG features, even at the final stages of the models.
Figure 4. The four plots visualize latent features after each pooling layer of the ResNet-34 encoder used in our proposed framework. Figure (a), (b), (c), (d), (e), each represent one of the stages. In Figure (e), we also visualize the separation between AF (atrial fibrillation) and Non-AF classes in the same latent space. At the later stage of the learned encoder in our proposed framework, both the photoplethysmography (PPG) and electrocardiogram (ECG) peak-only-signal features overlap almost perfectly with the original PPG and ECG-signal features.
Figure 5. Visualizations of latent features at the final average pooling layer of the encoders in the baseline models. Here, the encoded peak-only features never overlap with the photoplethysmography (PPG) and electrocardiogram (ECG) features.
During training processes for all experiments, we used 30 training epochs; ResNet-34 baseline models used Adam as the optimizer and a learning rate of 0.0001; the proposed method was trained using the stochastic gradient descent optimizer, with the learning rate set to 0.1 and momentum set to 0.9. For the DML baseline, the weight of the Kullback–Leibler-divergence loss was set to 0.9. For our proposed method, the
This article provides a new framework to learn from simultaneous ECG and PPG signals for AF detection. The framework has several important benefits over other approaches: (1) it is more accurate for detecting AF; (2) its latent space reflects that it uses medically relevant information for AF detection (namely, the signal peaks); (3) it can detect AF from either ECG or PPG modalities using a single model rather than using separate models for ECG and PPG.
We make some observations beyond our direct results. First, it seems that learning from simultaneous PPG and ECG is better than learning from them asynchronously (or from either one alone). This can be seen because our method and the only baseline that uses simultaneous PPG/ECG (namely, DML) perform better than the other methods. The cleaner ECG signals could provide some regularization to the model in cases where PPG signals are noisy or conflicting with the auto-generated labels due to signal noise.
Second, there is a large gap in performance between DML and our proposed approach. Since they both use simultaneous PPG/ECG, this gap is likely due to our proposed agreement loss, which directly encourages the encoder to learn coherent latent features between the two signal modalities. On the other hand, the DML setup relies on minimizing the KL divergence between the two prediction probability distributions, which is a weaker target that does not incentivize the encoders to focus on shared characteristics that exist in both the ECG and PPG signals. In fact, the two encoders in the two branches of DML could be vastly different and focus on less important features.
Third, the DeepBeat baseline performed worse on both Simband and Institution B data sets compared to other baseline methods; this indicates the poor generalizability of the DeepBeat model to fingertip-recorded PPG signals in the Institution B data set and out-of-distribution data in the Simband data set. That is, DeepBeat does not seem to generalize beyond the type of data it was trained on, suggesting that it is not focusing on the correct aspects of the signal for generalization. It was trained on the Stanford training set, and performs well on the Stanford test set, but not on data sets collected using different hardware and under different conditions. In contrast, our proposed method is trained on data collected in a hospital, yet attained comparable performance to DeepBeat on the Stanford test set, which demonstrates our strong generalizability to ambulatory data.
Fourth, we consider the question of how much nonpeak information might be valuable for AF detection. As we saw, other methods seem to be using information other than peaks to detect AF. And, as we also saw, that information does not seem to generalize across modalities (PPG/ECG) and across data collection modes (ambulatory vs
Lastly, we discuss the value and need for labels compared to the value of the agreement loss. Large labeled data sets are expensive to obtain and often noisy, particularly for medical applications. Our approach performed as well as or even better than some of the baselines, even using only 1% of the training labels. We show the AUROC scores and AUPRC scores of SiamAF trained with 1% labels in Appendix Tables A7 and A8. Compared to fully relying on supervised learning with noisy training labels, our proposed framework outperformed it by a significant amount. These results suggest that our learning regime greatly reduced the reliance on unreliable and noisy labels to provide learning supervision; the unsupervised portion of the training loss function—the agreement loss—provides a large portion of the meaningful gradients for parameter updates and steers the learning algorithms toward the correct information.
There are several limitations to the data set used in this study. First, while our framework and model are designed for broader application scenarios, including both healthy individuals and hospitalized patients, the training set included only patients in hospital settings. This homogeneous population may limit model performance and generalizability. Although the model has demonstrated robustness and generalizability across various hardware platforms and patient demographics from multiple institutions, it may still retain unknown biases derived from hospital-specific data.
Second, the data collection and labeling process relied on bedside monitors, which can produce false alarms and noisy labels. Such label noise could introduce the inherent biases of the monitor algorithm into the model. Additionally, label noise may disproportionately affect ECG data compared to PPG. Future work should explore robust learning algorithms to address these challenges and improve model performance.
Lastly, we recognize the potential biases introduced during preprocessing. The root causes of missing data remain unclear, but the missingness may disproportionately impact certain patients or subpopulations due to their health conditions or behavioral patterns. This could further introduce bias into the model. Future studies should examine the patterns of missing data and their effects on different subpopulations to mitigate this limitation.
In the experiment in Section 5.2, we showed that the projection of the peak-only signals highly overlaps with ECG and PPG signal projections in the latent space. However, it is not a complete overlap, thus we hypothesize that the model still leverages other potential morphological features for AF classification. Future work should also focus on developing more interpretable networks, which can provide explicit explanations for information used for classification. We also recognize that the visualization approach is not a formal verification for the proposed hypothesis on our model’s learned knowledge, and any dimension reduction method could produce inaccurate mappings of the high-dimensional space. Future work should quantitatively evaluate the learned knowledge of the proposed model.
In future work, we may consider examining other heart conditions than AF. AF is a prominent heart condition, but many other tasks are worth investigating. It is possible that our framework will assist with those tasks as well. Future studies should also expand the signal modalities beyond ECG and PPG. In this study, ECG and PPG could be framed as augmented measurements of the original heartbeat, augmented through the human body and the collection process. Another direction for future work is data augmentation. As usual, if we knew the transformations of the data that are invariant to AF detection, we could build those into our data, yielding more robust AF detection.
In this work, we mainly focus on improving the performance of AF detection from ECG and PPG signals and validate the model’s performance on several diverse data sets. However, to implement this model in a real-world setting, more factors must be considered, including inference time, energy efficiency of the model, privacy and hardware compatibility, and so on. Most of the aforementioned issues are being studied, and many solutions exist to address these problems, such as mixed precision computing, model quantization, and privacy-focused federated learning. Though beyond the scope of this study, future work could incorporate these considerations and focus on deploying our model in real-world clinical settings.
In this study, we proposed a new approach for AF detection: SiamAF. The proposed approach leverages a Siamese network architecture and a novel loss function, simultaneously learning from both ECG and PPG signal modalities. By learning common features between the two modalities (namely, peaks), our proposed model achieves superior AF detection performance compared to previous studies. SiamAF elevates the predictive power of the PPG signals to a level comparable to that of ECG without compromising model usability and improves PPG AF detection robustness even on noisy wearable data. Our findings offer a promising avenue for improving AF detection and potentially benefit both diagnosed AF patients and those unaware of their condition.
The present work is partially supported by NIH grant R01HL166233 and NSF grant IIS-2130250.
Andersen, R. S., Peimankar, A., & Puthusserypady, S. (2019). A deep learning approach for real-time detection of atrial fibrillation. Expert Systems with Applications, 115, 465–473. https://doi.org/10.1016/j.eswa.2018.08.011
Anwar, S. M., Majid, M., Qayyum, A., Awais, M., Alnowami, M., & Khan, M. K. (2018). Medical image analysis using convolutional neural networks: A review. Journal of Medical Systems, 42(11), Article 226. https://doi.org/10.1007/s10916-018-1088-1
Beede, E., Baylor, E., Hersch, F., Iurchenko, A., Wilcox, L., Ruamviboonsuk, P., & Vardoulakis, L. M. (2020). A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Paper 589). Association for Computing Machinery. https://doi.org/10.1145/3313831.3376718
Centers for Disease Control and Prevention, National Center for Health Statistics. (n.d.). About multiple cause of death, 1999–2019. CDC WONDER Online Database website. Retrieved February 1, 2021, from https://wonder.cdc.gov/mcd-icd10.html
Chen, C., Hua, Z., Zhang, R., Liu, G., & Wen, W. (2020). Automated arrhythmia classification based on a combination network of CNN and LSTM. Biomedical Signal Processing and Control, 57, Article 101819. https://doi.org/10.1016/j.bspc.2019.101819
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In H. Daumé III, & A. Singh (Eds.), Proceedings of the 37th International Conference on Machine Learning (pp. 1597–1607). Proceedings of Machine Learning Research. https://doi.org/10.5555/3524938.3525087
Chen, X., & He, K. (2021). Exploring simple Siamese representation learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15745–15753. IEEE. https://doi.org/10.1109/CVPR46437.2021.01549
Colilla, S., Crow, A., Petkun, W., Singer, D. E., Simon, T., & Liu, X. (2013). Estimates of current and future incidence and prevalence of atrial fibrillation in the U.S. adult population. American Journal of Cardiology, 112(8), 1142–1147. https://doi.org/10.1016/j.amjcard.2013.05.063
Corino, V. D. A., Laureanti, R., Ferranti, L., Scarpini, G., Lombardi, F., & Mainardi, L. T. (2017). Detection of atrial fibrillation episodes using a wristband device. Physiological Measurement, 38(5), Article 787. https://doi.org/10.1088/1361-6579/aa5dd7
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics, 44(3), 837–845. https://doi.org/10.2307/2531595
Ding, C., Guo, Z. Rudin, C., Xiao, R., Shah, A., Do, D. H., Lee, R. J., Clifford, G., Nahab, F. B., & Hu, X. (2024). Learning from alarms: A robust learning approach for accurate photoplethysmography-based atrial fibrillation detection using eight million samples labeled with imprecise arrhythmia alarms. IEEE Journal of Biomedical and Health Informatics, 28(5), 2650–2661. https://doi.org/10.1109/JBHI.2024.3360952
Eerikainen, L. M., Bonomi, A. G., Schipper, F., Dekker, L. R. C., de Morree, H. M., Vullings, R., & Aarts, R. M. (2020). Detecting atrial fibrillation and atrial flutter in daily life using photoplethysmography data. IEEE Journal of Biomedical and Health Informatics, 24(6), 1610–1618. https://doi.org/10.1109/JBHI.2019.2950574
Fan, X., Yao, Q., Cai, Y., Miao, F., Sun, F., & Li, Y. (2018). Multiscaled fusion of deep convolutional neural networks for screening atrial fibrillation from single lead short ECG recordings. IEEE Journal of Biomedical and Health Informatics, 22(6), 1744–1753. https://doi.org/10.1109/jbhi.2018.2858789
Gotlibovych, I., Crawford, S., Goyal, D., Liu, J., Kerem, Y., Benaron, D., Yilmaz, D., Marcus, G., & Li, Y. (2018). End-to-end deep learning from raw sensor data: atrial fibrillation detection using wearables. KDD 2018 Deep Learning Day. https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_21.pdf
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., Piot, B., kavukcuoglu koray, k., Munos, R., & Valko, M. (2020). Bootstrap your own latent - A new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33, 21271–21284. https://proceedings.neurips.cc/paper_files/paper/2020/file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf
Guo, Z., Ding, C., Hu, X., & Rudin, C. (2021). A supervised machine learning semantic segmentation approach for detecting artifacts in plethysmography signals from wearables. Physiological Measurement, 42(12), Article 125003. https://doi.org/10.1088/1361-6579/ac3b3d
Hannun, A. Y., Rajpurkar, P., Haghpanahi, M., Tison, G. H., Bourn, C., Turakhia, M. P., & Ng, A. Y. (2019). Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine, 25(1), 65–69. https://doi.org/10.1038/s41591-018-0268-3
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90
Kim, M.-G., & Pan, S. B. (2019). Deep learning based on 1-D ensemble networks using ECG for real-time user recognition. IEEE Transactions on Industrial Informatics, 15(10), 5656–5663. https://doi.org/10.1109/TII.2019.2909730
Kiranyaz, S., Ince, T., & Gabbouj, M. (2016). Real-time patient-specific ECG classification by 1-D convolutional neural networks. IEEE Transactions on Biomedical Engineering, 63(3), 664–675. https://doi.org/10.1109/TBME.2015.2468589
Krivoshei, L., Weber, S., Burkard, T., Maseli, A., Brasier, N., Kühne, M., Conen, D., Huebner, T., Seeck, A., & Eckstein, J. (2017). Smart detection of atrial fibrillation. EP Europace, 19(5), 753–757. https://doi.org/10.1093/europace/euw125
Kumar, M., Pachori, R. B., & Rajendra Acharya, U. (2018). Automated diagnosis of atrial fibrillation ECG signals using entropy features extracted from flexible analytic wavelet transform. Biocybernetics and Biomedical Engineering, 38(3), 564–573. https://doi.org/10.1016/j.bbe.2018.04.004
Lippi, G., Sanchis-Gomar, F., & Cervellin, G. (2021). Global epidemiology of atrial fibrillation: An increasing epidemic and public health challenge. International Journal of Stroke, 16(2), 217–221. https://doi.org/10.1177/1747493019897870
Makowski, D., Pham, T., Lau, Z. J., Brammer, J. C., Lespinasse, F., Pham, H., Schölzel, C., & Chen, S. H. A. (2021). NeuroKit2: A Python toolbox for neurophysiological signal processing. Behavior Research Methods, 53(4), 1689–1696. https://doi.org/10.3758/s13428-020-01516-y
Martis, R. J., Acharya, U. R., Adeli, H., Prasad, H., Tan, J. H., Chua, K. C., Too, C. L., Yeo, S. W. J., & Tong, L. (2014). Computer aided diagnosis of atrial arrhythmia using dimensionality reduction methods on transform domain representation. Biomedical Signal Processing and Control, 13, 295–305. https://doi.org/10.1016/j.bspc.2014.04.001
Oh, S. L., Ng, E. Y., Tan, R. S., & Acharya, U. R. (2018). Automated diagnosis of arrhythmia using combination of CNN and LSTM techniques with variable length heart beats. Computers in Biology and Medicine, 102, 278–287. https://doi.org/10.1016/j.compbiomed.2018.06.002
Poh, M.-Z., Poh, Y. C., Chan, P.-H., Wong, C.-K., Pun, L., Leung, W. W.-C., Wong, Y.-F., Wong, M. M.-Y., Chu, D. W.-S., & Siu, C.-W. (2018). Diagnostic assessment of a deep learning system for detecting atrial fibrillation in pulse waveforms. Heart, 104(23), 1921–1928. https://doi.org/10.1136/heartjnl-2018-313147
Pourbabaee, B., Roshtkhari, M. J., & Khorasani, K. (2018). Deep convolutional neural networks and learning ECG features for screening paroxysmal atrial fibrillation patients. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(12), 2095–2104. https://doi.org/10.1109/TSMC.2017.2705582
Qayyum, A., Meriaudeau, F., & Chan, G. C. Y. (2018). Classification of atrial fibrillation with pre-trained convolutional neural network models. In Proceedings of the 2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (pp. 594–599). IEEE. https://doi.org/10.1109/IECBES.2018.8626624
Salih, M., Abdel-Hafez, O., Ibrahim, R., & Nair, R. (2021). Atrial fibrillation in the elderly population: Challenges and management considerations. Journal of Arrhythmia, 37(4), 912–921. https://doi.org/10.1002/joa3.12580
Sannino, G., & De Pietro, G. (2018). A deep learning approach for ECG-based heartbeat classification for arrhythmia detection. Future Generation Computer Systems, 86, 446–455. https://doi.org/10.1016/j.future.2018.03.057
Sañudo, B., De Hoyo, M., Muñoz-López, A., Perry, J., & Abt, G. (2019). Pilot study assessing the influence of skin type on the heart rate measurements obtained by photoplethysmography with the apple watch. Journal of Medical Systems, 43(7), Article 195. https://doi.org/10.1007/s10916-019-1325-2
Shan, S.-M., Tang, S.-C., Huang, P.-W., Lin, Y.-M., Huang, W.-H., Lai, D.-M., & Wu, A.-Y. A. (2016). Reliable PPG-based algorithm in atrial fibrillation detection. In Proceedings of the 2016 IEEE Biomedical Circuits and Systems Conference (pp. 340–343). IEEE. https://doi.org/10.1109/BioCAS.2016.7833801
Shashikumar, S. P., Shah, A. J., Li, Q., Clifford, G. D., & Nemati, S. (2017). A deep learning approach to monitoring and detecting atrial fibrillation using wearable technology. In Proceedings of the 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (pp. 141–144). IEEE. https://doi.org/10.1109/BHI.2017.7897225
Sun, X., & Xu, W. (2014). Fast implementation of DeLong’s algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Processing Letters, 21(11), 1389–1393. https://doi.org/10.1109/LSP.2014.2337313
Tang, S.-C., Huang, P.-W., Hung, C.-S., Shan, S.-M., Lin, Y.-H., Shieh, J.-S., Lai, D.-M., Wu, A.-Y., & Jeng, J.-S. (2017). Identification of atrial fibrillation by quantitative analyses of fingertip photoplethysmogram. Scientific Reports, 7, Article 45644. https://doi.org/10.1038/srep45644
Tison, G. H., Sanchez, J. M., Ballinger, B., Singh, A., Olgin, J. E., Pletcher, M. J., Vittinghoff, E., Lee, E. S., Fan, S. M., Gladstone, R. A., Mikell, C., Sohoni, N., Hsieh, J., & Marcus, G. M. (2018). Passive detection of atrial fibrillation using a commercially available smartwatch. JAMA Cardiology, 3(5), 409–416. https://doi.org/10.1001/jamacardio.2018.0136
Torres-Soto, J., & Ashley, E. A. (2020). Multi-task deep learning for cardiac rhythm detection in wearable devices. NPJ Digital Medicine, 3, Article 116. https://doi.org/10.1038/s41746-020-00320-4
Tutuko, B., Nurmaini, S., Tondas, A. E., Rachmatullah, M. N., Darmawahyuni, A., Esafri, R., Firdaus, F., & Sapitri, A. I. (2021). AFibNet: An implementation of atrial fibrillation detection with convolutional neural network. BMC Medical Informatics and Decision Making, 21(1), Article 216. https://doi.org/10.1186/s12911-021-01571-1
van Gent, P., Farah, H., van Nes, N., & van Arem, B. (2019a). Analysing noisy driver physiology real-time using off-the-shelf sensors: Heart rate analysis software from the Taking the Fast Lane project. Journal of Open Research Software, 7(1), Article 32. https://doi.org/10.5334/jors.241
van Gent, P., Farah, H., van Nes, N., & van Arem, B. (2019b). HeartPy: A novel heart rate algorithm for the analysis of noisy signals. Transportation Research Part F: Traffic Psychology and Behaviour, 66, 368–378. https://doi.org/10.1016/j.trf.2019.09.015
Voisin, M., Shen, Y., Aliamiri, A., Avati, A., Hannun, A., & Ng, A. (2019). Ambulatory atrial fibrillation monitoring using wearable photoplethysmography with deep learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1909–1916). Association for Computing Machinery. https://doi.org/10.1145/3292500.3330657
Wang, J. (2020). A deep learning approach for atrial fibrillation signals classification based on convolutional and modified Elman neural network. Future Generation Computer Systems, 102, 670–679. https://doi.org/10.1016/j.future.2019.09.012
Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021). Understanding how dimension reduction tools work: An empirical approach to deciphering t-SNE, UMAP, TriMap, and PaCMAP for data visualization. Journal of Machine Learning Research, 22, Article 201. http://jmlr.org/papers/v22/20-1061.html
Xia, Y., Wulan, N., Wang, K., & Zhang, H. (2018). Detecting atrial fibrillation by deep convolutional neural networks. Computers in Biology and Medicine, 93, 84–92. https://doi.org/10.1016/j.compbiomed.2017.12.007
Ying, X. (2019). An overview of overfitting and its solutions. Journal of Physics: Conference Series, 1168(2), Article 022022. https://doi.org/10.1088/1742-6596/1168/2/022022
Zech, J. R., Badgeley, M. A., Liu, M., Costa, A. B., Titano, J. J., & Oermann, E. K. (2018). Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine, 15(11), Article e1002683. https://doi.org/10.1371/journal.pmed.1002683
Zhang, Y., Xiang, T., Hospedales, T. M., & Lu, H. (2018). Deep mutual learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (pp. 4320–4328). IEEE. https://doi.org/10.1109/CVPR.2018.00454
We conducted bootstrapping tests with
This section contains the p values of the pairwise statistical significance tests between tested models. For comparing the AUROC scores between any two models, we used the pairwise Delong test (DeLong et al., 1988) and the Python FastDelong package (Sun & Xu, 2014). For comparing the AUPRC scores between any two models, we used the pairwise ttest. After adjusting the threshold
ResNet-34 (PPG) | |||
DML | |||
DeepBeat | |||
SiamAF | ResNet-34 (PPG) | DML |
ResNet-34 (PPG) | |||
DML | |||
DeepBeat | |||
SiamAF | ResNet-34 (PPG) | DML |
ResNet-34 (PPG) | |||
DML | |||
DeepBeat | |||
SiamAF | ResNet-34 (PPG) | DML |
ResNet-34 (ECG) | ||
DML | ||
SiamAF | ResNet-34 (ECG) |
ResNet-34 (ECG) | ||
DML | ||
SiamAF | ResNet-34 (ECG) |
We display the Institution A participant demographic information in detail in Table A6.
Demographic Category | Patient Counts | Percentage |
---|---|---|
Gender | ||
Male | 15,330 | 53.7% |
Female | 13,203 | 46.2% |
Others | 6 | 0% |
Age | ||
<22 years | 3,925 | 13.8% |
22–39 years | 2,715 | 9.5% |
40–54 years | 4,372 | 25.3% |
55–64 years | 5,370 | 18.8% |
12,157 | 42.6% | |
Race | ||
White or Caucasian | 15,890 | 55.7% |
Black or African American | 2,159 | 7.4% |
Asian | 4,364 | 15.0% |
Native Hawaiian or Other Pacific Islander | 426 | 1.46% |
American Indian or Alaska Native | 212 | 0.7% |
Unknown/Declined | 1,149 | 4% |
Others | 4,913 | 16.9% |
Section A.4. includes the detailed evaluation results.
Data set/AUROC | ResNet-34(PPG) | ResNet-34(ECG) | DML | DeepBeat | SiamAF | SiamAF (1% training labels) |
---|---|---|---|---|---|---|
Simband (ECG) | N/A | 0.724 [0.722 0.727] | 0.721 [0.718 0.723] | N/A | 0.747 [0.744 0.75] | 0.729 [0.726 0.732] |
Simband (PPG) | 0.879 [0.878 0.881] | N/A | 0.891 [0.889 0.892] | 0.870 [0.868 0.871] | 0.914 [0.913 0.916] | 0.9 [0.898 0.901] |
Institution B (ECG) | N/A | 0.89 [0.887 0.893] | 0.905 [0.902 0.908] | N/A | 0.927 [0.925 0.929] | 0.899 [0.897 0.902] |
Institution B (PPG) | 0.918 [0.916 0.92] | N/A | 0.92 [0.918 0.922] | 0.872 [0.87 0.875] | 0.924 [0.922 0.925] | 0.907 [0.905 0.909] |
Stanford test set | 0.763 [0.761 0.764] | N/A | 0.764 [0.763 0.766] | 0.883 [0.882 0.884] | 0.877 [0.876 0.878] | 0.8 [0.799 0.801] |
Data set/AUPRC | ResNet-34 (PPG) | ResNet-34 (ECG) | DML | DeepBeat | SiamAF | SiamAF (1% training labels) |
---|---|---|---|---|---|---|
Simband (ECG) | N/A | 0.621 [0.617 0.625] | 0.675 [0.671 0.679] | N/A | 0.732 [0.729 0.736] | 0.661 [0.658 0.665] |
Simband (PPG) | 0.841 [0.838 0.843] | N/A | 0.847 [0.844 0.845] | 0.799 [0.796 0.803] | 0.865 [0.863 0.868] | 0.841 [0.838 0.844] |
Institution B (ECG) | N/A | 0.726 [0.72 0.732] | 0.749 [0.743 0.755] | N/A | 0.765 [0.759 0.772] | 0.736 [0.731 0.742] |
Institution B (PPG) | 0.768 [0.762 0.774] | N/A | 0.778 [0.772 0.784] | 0.607 [0.601 0.614] | 0.773 [0.768 0.779] | 0.744 [0.739 0.749] |
Stanford test set | 0.582 [0.577 0.586] | N/A | 0.613 [0.609 0.617] | 0.726 [0.723 0.73] | 0.726 [0.722 0.729] | 0.565 [0.56 0.569] |
Figure A6. The test set performance of our proposed model with different λ value settings ranging from 0.2 to 1.2 with 0.2 increments. AUROC = area under the receiver-operator-characteristic curve; AUPRC = area under the precision-recall curve; ECG = electrocardiogram; PPG = photoplethysmography.
All development code for this study is available publicly at https://github.com/chengstark/SiamAF. The retrospective use of the Institution A data set in this study is conducted according to the terms of a data use agreement between UCSF and Emory University. As a result, the data set is not available for public use without additional institutional approvals. Similarly, the Institution B data set is not available for public use. To obtain access to the private data sets utilized in this study, please contact the corresponding author. Requests will undergo evaluation in accordance with the data use agreement established between institutions and may necessitate escalation to the relevant institutional data governance committee(s). The remaining external data sets used for evaluation in this study are publicly available.
©2025 Zhicheng Guo, Cheng Ding, Duc Do, Amit Shah, Randall J. Lee, Xiao Hu, and Cynthia Rudin. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.