Smartwatches and other wearable devices are equipped with photoplethysmography (PPG) sensors for monitoring heart rate and other aspects of cardiovascular health. However, PPG signals collected from such devices are susceptible to corruption from noise and motion artifacts, resulting in inaccuracies during heart rate estimation. Conventional denoising methods filter or reconstruct signals in ways that eliminate morphological information, even from the clean segments of the signal that should ideally be preserved. In this work, we develop an algorithm for denoising PPG signals that reconstructs the corrupted parts of the signal, while preserving the clean parts of the PPG signal. Our novel framework relies on self-supervised training, where we leverage a large database of clean PPG signals to train a denoising autoencoder. As we show, our reconstructed signals provide better estimates of heart rate from PPG signals than the leading heart rate estimation methods. Further experiments show improvement in heart rate variability (HRV) estimation from PPG signals using our algorithm. We conclude that our algorithm denoises PPG signals in a way that can improve downstream analysis of health metrics from wearable devices.
Keywords: unsupervised learning, denoising algorithm, heart rate detection, wearable medical devices, physiological signals
Accurate cardiovascular health metrics from wearables can facilitate early diagnosis of cardiac conditions, guide exercise recovery, and enable personalized health monitoring. Smartwatches and fitness trackers use a technique called photoplethysmography, or PPG, to measure heart rate and other health signals. These devices emit light onto skin and measure changes in light absorption in the bloodstream. However, PPG signals are often distorted by motion-related noise, giving inaccurate heart rate readings. It is essential to improve the reliability of photoplethysmography to make health tracking more accurate.
In a new study from Duke and Emory Universities, the authors develop a method called SPEAR (Self-supervised PPG Erase Artifacts and Reconstruct) that denoises PPG signals recorded from wrist-worn devices, generating clean signals that can be used for reliable estimation of heart rate. Their approach uses a large public database of clean PPG recordings to train a generative machine learning algorithm. The algorithm identifies noisy regions in a signal, removes them, and reconstructs the corrupted regions to produce a clean signal.
Experiments show that SPEAR improves heart rate accuracy over existing denoising techniques when tested on PPG data across different subjects, while performing a variety of daily activities. Beyond heart rate, the denoised signals also provided better measurements of heart rate variability, a metric that tracks health and fitness.
SPEAR’s denoising algorithm thus could make smartwatch-based health tracking more reliable.
Photoplethysmography (PPG) is a noninvasive optical measurement technique that provides vital information about the cardiovascular system. A PPG-enabled device consists of an optical sensor that measures volumetric variations of blood circulation as a PPG signal. Modern PPG-enabled devices include a variety of technologies such as fingertip-based pulse oximeters, forehead- and earlobe-based PPG sensors, and most commonly, wrist-worn smartwatches (Castaneda et al., 2018). PPG monitoring can enable early detection of serious heart conditions that otherwise might go undetected (Allen et al., 2006; Pereira et al., 2020). A key application of PPG in wearable devices is the estimation of heart rate (HR) (Almarshad et al., 2022).
PPG is limited by its susceptibility to noise artifacts, including motion artifacts (MA) caused by body movements and artifacts arising from environmental factors like ambient light, sweat, and pressure (Y. Zhang et al., 2020). In order to ensure accuracy of HR estimates and robust diagnosis of medical conditions, it is essential to mitigate such artifacts. Methods that address this limitation for prediction of HR from PPG signals can be broadly categorized into two types. The first type estimates HR directly from the signals despite the presence of artifacts (Biswas et al., 2019; Panwar et al., 2020; Reiss et al., 2019; Shyam et al., 2019; Temko, 2017). The second type of method attempts to extract, denoise, or reconstruct a clean signal from the noise-corrupted signal (Chang et al., 2021; Galli et al., 2018; Kasambe & Rathod, 2015; Wu et al., 2017). These methods output a clean signal that can potentially be used for multiple downstream tasks, including HR estimation, which makes them more generally useful. These approaches reconstruct the entire PPG signal, even if most of the signal may already be artifact-free. This may potentially distort the original signal and cause loss of morphological information even in the useful parts of the signal. Ideally, we would like to have a method that denoises only the noisy part of the signal—preserving the valuable information in the uncorrupted part—and provides a clean signal that can be used for accurate HR estimation and for other downstream tasks. That is the focus of the present work.
In this article, we present a novel method for reconstructing clean PPG signals from noisy signals. It preserves the useful segments of the PPG signals that are uncorrupted, and only reconstructs the corrupted sections. This is achieved by decoupling the tasks of artifact detection and removal. We apply an artifact-detection algorithm to remove artifacts from the signal, and then use a denoising autoencoder to reconstruct the signal only in the regions where artifacts were removed. The denoised signal is then used for HR estimation using band-pass filtering and peak detection. This way, our reconstructed signals are more faithful to the truth and more useful for downstream tasks.
An interesting aspect of the proposed approach is the way it leverages publicly available data. This study relies on two sufficiently large and complex public PPG data sets, PPG-DaLiA (Reiss et al., 2019) and the Stanford data set (Torres-Soto & Ashley, 2020). PPG-DaLiA records subjects in a wide variety of settings (e.g., walking, driving, eating, cycling). It is a high-quality data set that contains an external means of extracting ground-truth HR using electrocardiogram (ECG) signals that are simultaneously recorded and are relatively free from noise. On the other hand, the Stanford data set is much larger, but has only PPG signals and no external way to assess ground-truth HR. Deciding how to best leverage these data sets was a challenge. We chose to use PPG-DaLiA only for out-of-sample testing purposes, since it has ECG for ground truth and thus can provide an honest assessment of HR. We extract and leverage the clean signals from within the Stanford data set to devise a self-supervised training methodology that is able to reconstruct realistic clean signals.
The proposed method is called SPEAR—Self-supervised PPG Erase Artifacts and Reconstruct—and it is a novel algorithm for denoising PPG signals. SPEAR learns to denoise PPG signals using the following training and evaluation paradigm, outlined in Figure 1: (1) Removal of segments with artifacts using an artifact-detection algorithm, leaving only the clean signals, (2) Erasing random parts of the clean signals, and (3) Training a denoising autoencoder to reconstruct these erased parts of the clean signals. The signal is reconstructed in such a way that only the locations that have been erased are reconstructed, and the rest of the signal is unchanged. In this way, given a new noisy signal for testing, our method would (1) apply the artifact-detector, (2) erase the artifacts, and (3) reconstruct the missing pieces using the trained denoising autoencoder to form a clean signal that can be used for downstream tasks. Since it has learned to reconstruct from clean PPG signals in training, it will reconstruct clean signals during testing. We estimate HR using band-pass filtering and peak detection; this type of basic method works precisely because the PPG signal is now clean. The effectiveness of step (2) of the training process depends on how correlated the distortion mechanism is with the mechanism that generates the real signal. As an extreme example, if artifacts happen whenever atrial fibrillation (aFib) happens, distorting the entire aFib signal, the detection of aFib would be rendered infeasible. The precise formulation of the weakest assumption for this correlation to render mathematically provable efficiency of our method is a topic for future research. Our empirical evidence suggests that our methods work well since the signals are not excessively corrupted by external factors, and since artifacts are typically shorter in time-scale than the signals we want to study.
The experimental results reveal that traditional signal processing techniques generally achieve limited efficacy in heart rate estimation, and that supervised deep learning methods show better estimation accuracy on data sets they are trained on, with diminished generalizability to other data sets. SPEAR does not exhibit these limitations. Its performance on the PPG-DaLiA test set is comparable to that of deep learning methods trained on the same distribution, despite SPEAR being trained on the Stanford data set. On a hold-out test set from the Stanford data set, SPEAR outperforms all other methods. Most importantly, the fact that SPEAR produces clean, continuous PPG signals allows the results to be used for downstream tasks beyond heart rate estimation. This study also investigates heart rate variability (HRV) estimation as a downstream task, revealing that the accuracy of HRV estimates benefits from the utilization of the denoised PPG signals produced by SPEAR.
Signal Quality and Artifact Detection Techniques. Various studies focus on assessing the quality of the PPG signal. Lin et al. (2019) propose a statistical approach that computes five key characteristics of the signal to determine signal quality and reject noisy outliers. Another approach (Goh et al., 2020) divides the signal into sliding windows and uses Convolutional Neural Networks (CNNs) to classify whether each window contains an artifact. A study by Guo et al. (2021) approaches this task as a 1D segmentation problem and uses a convolutional network to classify noise-corrupted regions within a signal. This allows for detection of noise artifacts on a higher resolution. These approaches only detect the presence of noise and do not provide further steps on mitigating the artifacts for HR analysis. We utilize the Segade model (Guo et al., 2021) as a preprocessing step in SPEAR.
Artifact Reduction Techniques. Signal processing methods have been used to reduce artifacts in PPG signals. Specifically, discrete wavelet transforms (Kasambe & Rathod, 2015), adaptive filtering (Comtois et al., 2007; Pan et al., 2016; Wu et al., 2017) and independent component analysis (ICA) (Peng et al., 2014) have been used to perform signal denoising. Salehizadeh et al. (2016) perform sliding window-based signal denoising using spectral filtering. A recent study (Bradley & Kyriacou, 2024) proposed a nonfiltering signal processing approach that uses anomaly detection to find segmentation points in the signal and remove noise artifacts. A limitation of signal processing approaches is that their performance is dependent on heuristic thresholds and parameters. Reiss et al. (2019) demonstrate that state-of-the-art signal processing techniques perform poorly on a larger and more comprehensive data set (PPG DaLiA, Reiss et al., 2019) compared to the smaller TROIKA data set (Z. Zhang et al., 2015) that they have been tested on.
Recent works have also introduced deep learning–based approaches for this problem. Lee et al. (2019) use a bidirectional recurrent autoencoder for PPG denoising trained on hand-picked clean PPG signals. DeepHeart (Chang et al., 2021) uses a denoising convolutional network followed by spectrum analysis–based calibration to perform HR estimation. In this approach, signal reconstruction is performed for small, overlapping time windows; as a result, the reconstructed clean signals cannot easily be joined together to get a continuous long signal. The survey paper by Mishra and Nirala (2020) on PPG denoising techniques concludes that deep learning approaches perform better for denoising signals that are affected by motion artifacts, as compared to signal-processing approaches.
A limitation of existing noise reduction approaches is that they cannot discern when the degree of noise corruption is severe and may produce unexpected results if large parts of the signal is completely lost to motion artifacts (Park et al., 2022). An approach that classifies the signal into clean/noisy segments and selectively analyzes the classified sections can mitigate this issue (Park et al., 2022). Another challenge with denoising is the availability of noisy-clean signal pairs in the data, which is required for supervised learning. This is hard to obtain with PPG signals because it is not possible to record clean and noisy signals synchronously while performing certain activities. Workarounds are typically used to overcome this challenge, such as generating fake noisy signals by adding simulated noise to clean signals (Lee et al., 2019). We will discuss how we overcome this challenge using self-supervised learning in Section 3.
Direct Heart Rate Estimation Without Denoising. A category of methods focuses on the task of estimating heart rate directly from noisy PPG signals, without attempting to reconstruct or denoise the signal. Signal processing techniques including Wiener filtering (Temko, 2017), least-means square adaptive filtering (Schäck et al., 2015), and TROIKA (Z. Zhang et al., 2015) utilize accelerometer data and analyze the signals in the frequency domain. Deep learning has also been utilized for HR estimation, most commonly as a supervised learning task where ground-truth HR labels are obtained from synchronous electrocardiogram (ECG) signals. DeepPPG (Reiss et al., 2019) and PPGnet (Shyam et al., 2019) use deep convolutional networks to predict heart rate from noisy PPG signals. CorNET (Biswas et al., 2019) uses a combination of CNNs and LSTMs (long-short-term-memory) to predict HR from single-channel PPG signals for patient-specific models. PP-Net (Panwar et al., 2020) also uses CNNs and LSTMs for HR estimation using single-channel PPG. These approaches tend to outperform the denoising approaches for HR estimation, but do not output a denoised signal that can be utilized in downstream analysis.
Heart Rate Variability (HRV) Estimation from PPG. HRV measures the fluctuation in the time intervals between adjacent heartbeats (Shaffer & Ginsberg, 2017). HRV is used to investigate the sympathetic and parasympathetic function of the autonomic nervous system (Shaffer & Ginsberg, 2017) and has many important applications, including predicting risk of stroke (Tsuji et al., 1996), detecting arrhythmia (Tsipouras & Fotiadis, 2004), and guiding training for athletes (Singh et al., 2018). Studies that use PPG signals to estimate HRV focus only on clean signals obtained from subjects at rest (Lu et al., 2008). A recent study showed poor performance of HRV estimation from PPG signals under free-living conditions (Lam et al., 2020). Denoising PPG signals can improve the accuracy of HRV monitoring in real-world conditions, as we will show.
Denoising Autoencoders. Autoencoders are networks that learn to reconstruct their inputs from a latent representation. An autoencoder takes as input a vector
We propose a self-supervised training approach that requires only a sufficiently large collection of clean PPG signals, and does not require synchronous ECG measurements, like other approaches (Biswas et al., 2019; Chang et al., 2021; Panwar et al., 2020; Reiss et al., 2019; Shyam et al., 2019). Using a self-supervised training approach solves the challenge of unavailable noisy-clean training pairs required for supervised learning. This is because the model is trained to reconstruct clean signals from signals where the noise has been erased—it requires no information about the noise artifacts except for where they are located.
In this subsection, we outline the methodology of training SPEAR’s specialized denoising autoencoder. Figure 2 summarizes the training procedure. The process for denoising a new signal will be discussed afterwards and in Figure 3.
The first step to prepare the training data is selecting clean PPG signals. A noise detection model is used to detect the occurrence of artifacts in the signal; any signal determined to have no corrupted regions is deemed as a clean signal. We use the Segade model (Guo et al., 2021) for this purpose. Given a 30-second PPG signal as input, Segade predicts the regions within the signal that are corrupted by noise artifacts. Segade is the state-of-the-art segmentation model for noise detection: it outperforms other noise artifact detection methods on the DICE score, a well-established measure for segmentation accuracy, by a large margin (Guo et al., 2021). Further, it has been tested on several well-known public PPG data sets (Reiss et al., 2019; Schmidt et al., 2018; Z. Zhang et al., 2015) in comparison to other approaches.
In the next step, a denoising autoencoder (DAEs) (Vincent et al., 2008) is given a partially corrupted signal as input and trained to recover the original signal. Training a DAE requires element-wise pairs of signals
The denoising autoencoder consists of an encoder network that maps the input signal to its latent space, and a decoder network that reconstructs the clean signal from the latent space input. The encoder network consists of four convolutional layers, each followed by a ReLU activation and batch normalization. The decoder network consists of four transpose convolution-ReLU-batch-normalization blocks. The fourth block is followed by a convolutional layer with a sigmoid activation that outputs the reconstructed signal. The encoder receives input in the dimension
Loss was computed as the root mean square error (RMSE) between the original clean and the reconstructed signals. The model was optimized using the Adam optimizer (Kingma & Ba, 2014), trained over 50 epochs.
In Section 3.1, we discussed the training procedure for the denoising autoencoder used in our algorithm. In this section, we provide an end-to-end framework for denoising PPG signals and estimating HR using SPEAR. This process is illustrated in Figure 3.
The first step in the signal denoising algorithm is to locate the noise artifacts. The Segade model is again used for this purpose. Similar to the preprocessing defined in Section 3.1, first the signals are fitted to 30-second segments sampled at 64 Hz. The signals are normalized to the
Signals that are excessively corrupted beyond recovery are discarded. We consider a
The remaining signal is used as input for the denoising autoencoder model. The locations where Segade output has a classification label of 1 (indicating the presence of a noise artifact) are erased in the signal (set to 0). Let
Here,
The final step is to postprocess the clean signals in
where
To perform HR estimation, a bandpass filter with low-end cutoff of 0.9 Hz and a high-end cutoff of 5 Hz is applied, and the signal is re-normalized to the
In this section, we describe the data sets, baselines, experimental setup, and evaluation metrics for heart rate estimation.
Two data sets were used in this study: the Stanford data set (Torres-Soto & Ashley, 2020) and PPG DaLiA data set (Reiss et al., 2019); both data sets consist of PPG recordings collected from a wrist-worn device sampled at 64Hz. Table 1 compares the main properties of these two data sets. The Stanford data set contains a large publicly available collection of PPG signals from wrist-worn wearables. The data set is divided into training, validation, and test sets with no subject overlap. The training set was used for training the reconstruction model. The validation set was used for hyperparameter tuning. The test set was used for testing the performance of SPEAR in comparison with baselines.
PPG DaLiA has a comprehensive data collection protocol from subjects of different ages while performing a variety of daily activities such as walking, cycling, driving, among others. It also provides synchronous ECG and accelerometer recordings, which are required by some baseline algorithms. The Stanford test and PPG DaLiA data sets (Reiss et al., 2019) were used for out-of-sample testing and comparison against baselines.
Property | PPG DaLiA | Stanford |
---|---|---|
PPG | Available | Available |
ECG | Available | Unavailable |
Accelerometer | Available | Unavailable |
No. of subjects | 15 | 149 |
No. of train samples (clean) | 3,400 (233) | 62,000 (7,400) |
No. of test samples | 872 | 6,700 |
Note. The first three rows specify the kinds of signals available in the data sets. Row 5 shows the number of PPG signal segments available for training, along with the number of clean signal segments in parentheses. A sample corresponds to a PPG signal segment of duration 30 seconds.
In this section, we introduce the six state-of-the-art baseline methods that were used for comparison. Baselines 1 and 2 use signal processing techniques and were chosen because they achieved the best performance on the TROIKA data set (Z. Zhang et al., 2015), used in the IEEE Signal Processing Cup. Baselines 3-6 use deep learning for HR estimation without denoising; the models are based on the works of (Biswas et al., 2019; Panwar et al., 2020; Reiss et al., 2019; Shyam et al., 2019). Although none of these studies have publicly available code, these approaches were chosen for their performance and feasibility of implementation. Implementations for other baselines (Chang et al., 2021; Comtois et al., 2007; Lee et al., 2019; Peng et al., 2014; Salehizadeh et al., 2016; Wu et al., 2017; Z. Zhang et al., 2015) are not publicly available and not reproducible.
Baseline 1: Wiener Filtering and Phase Vocoder (WFPV). This baseline is based on Temko (2017). It uses the three-axis accelerometer signals to estimate the noise signature and applies a Wiener filter to attenuate noise components in the frequency domain. A phase vocoder is used to estimate HR.
Baseline 2: Kalman Filtering. This baseline is based on Galli et al. (2018) and is a signal processing technique that produces a reconstructed PPG signal over small time windows. It performs signal decomposition over PPG and three-axis accelerometer signals and performs clean PPG reconstruction based instead on the degree of correlation of PPG with accelerometer signals. Kalman smoothing is used for HR estimation from the reconstructed signal.
Baseline 3: CNN_HR_DaLiA
. This baseline model uses supervised learning to estimate HR directly from sliding time windows over a noisy PPG signal. Our version of this baseline was based on DeepPPG (Reiss et al., 2019) and PPGNet (Shyam et al., 2019). The single-channel PPG signals in the PPG DaLiA data set were used for training and HR ground-truth labels were obtained from the synchronously recorded ECG.
Baseline 4: CNN+LSTM_HR_DaLiA
. This baseline follows the same approach of direct HR estimation on sliding windows as Baseline 3. Our version of this baseline is based on the PP-Net (Panwar et al., 2020) and CorNET (Biswas et al., 2019) models. It is trained using the same procedure as Baseline 3; the only difference is in the model architecture. This model uses a combination of convolutional and LSTM layers.
Baseline 5: CNN_HR_Stanford
. This model is architecturally identical to Baseline 3, but was trained on the Stanford training set. Since the Stanford data set includes a significantly larger collection of signals, it was important to establish comparisons with baselines trained on similar data as SPEAR. However, since the Stanford data set does not contain ECG signals for ground-truth labels, the technique of clean signal selection and simulated corruption to generate noisy-clean training pairs was used. Clean signals were selected using the technique defined in Section 3.1 and simulated noise artifacts were added using the RRest toolbox (Charlton, 2022): a combination of frequency modulation (FM) and baseline wander (BW) were added to clean signals, while ensuring that no more than 75% of the signal is corrupted to match the maximum degree of corruption expected by SPEAR. HR labels were generated on the clean PPG pairs using the technique from Elgendi et al. (2013).
Baseline 6: CNN+LSTM_HR_Stanford
. This model is trained on the Stanford training set using the procedure as Baseline 5. The model uses a combination of convolutional and LSTM layers and has an identical architecture to Baseline 4. Details on model architecture for the HR prediction baselines (Baselines 3-6) are provided in Appendix A.2.
Baseline 7: DAE_SimNoise
. This is a denoising model based on the training approach of Lee et al. (2019). To train this model, noisy-clean pairs of PPG signals were generated by selecting clean PPG signal segments of duration 30 seconds and adding simulated noise. The noise simulation procedure was similar to Baselines 5–6, where a combination of FM and BW noise were added, while ensuring total corruption is under 75%. A training data set was generated from the Stanford data such that the number of samples were roughly equal to SPEAR’s training set. A denoising autoencoder with an identical architecture to SPEAR was trained to denoise the signals.
PPG DaLiA Experiment: The PPG DaLiA data set was divided into a training and test set. The test set contains signals from subjects 1, 14, and 15. The signals were first split into non-overlapping 30-second segments. Segments that were more than 75% corrupted were discarded (same as our method). The corresponding ECG as well as the three accelerometer signals were similarly segmented. The accepted signals were joined into one longer signal and used for testing the baselines. For Baselines 1 and 2, the continuous subject-wise signals were used as input. For Baselines 3–6, the signals were segmented into 8-second overlapping windows (with an overlap of 6 seconds). Heart Rate estimation was performed on the PPG signals using the ECG for ground-truth labels.
Stanford Data Set Experiment: The Stanford test set was used for this experiment. Since this data set does not contain ECG for ground-truth measurement, we utilized the clean signals as the source of ground truth. We introduce simulated noise in the clean signals similar to the training procedure defined for Baseline 5 using the RRest package (Charlton, 2022). This produced clean-noisy signal pairs. Ground-truth HR was computed on the clean signal using a peak detection algorithm (Elgendi et al., 2013). Baselines 1 and 2 could not be evaluated on this data set since they require three-axis accelerometer data as input, which is not available in the Stanford data.
In this experiment, we estimated HRV from PPG signals. The goal of this experiment was to evaluate whether denoised signals produced by SPEAR can be utilized for downstream tasks and provide improvements over existing methods. Several metrics are used to measure HRV. In our experiments, we focus on two metrics: the standard deviation of inter-beat (NN) intervals (SDNN) and the root mean square of successive differences between normal heartbeats (RMSSD) (Shaffer & Ginsberg, 2017). Both metrics are measured in milliseconds (ms). A review of HRV-capable wearable devices shows that RMSSD and SDNN are the two metrics that are most commonly available on such devices (Hinde et al., 2021). SDNN is generally studied in clinical settings, considered to be the ‘gold standard’ metric for assessing cardiac risk, and used for predicting morbidity and mortality (Shaffer & Ginsberg, 2017).
HRV requires continuous measurement over long time duration; typically, a time window of 5 minutes is used for short-duration estimation (Shaffer & Ginsberg, 2017). The full PPG DaLiA data set was used for evaluation. The signals were segmented into sliding windows of duration 5 minutes and an offset of 15 seconds. The two HRV metrics (SDNN and RMSSD) were computed from PPG signals and synchronously recorded ECG (for ground truth). Mean absolute error was computed between HRV estimates from PPG and HRV ground truth from the corresponding ECG.
There is a dearth of literature on directly estimating HRV from PPG, and existing denoising approaches (Chang et al., 2021; Galli et al., 2018; Kasambe & Rathod, 2015; Wu et al., 2017) do not directly work for our HRV estimation task. This is because they perform denoising on short, overlapping (8-second) signal segments for heart rate estimation; they do not offer methods on reconstructing longer signals that can be used for HRV. Consequently, the baselines used for HR estimation could not be adapted for the HRV task, except Baseline 7. SPEAR, as well as Baseline 7, reconstruct non-overlapping 30-second signals that can be combined into continuous long-term recordings using interpolation. Thus, existing approaches based on simple peak detection can be applied for HRV estimation. For comparison, we perform HRV estimation on four variants of the PPG signals: the original signal from the data set, the signal after applying bandpass filtering, the denoised signal from DAE_SimNoise
(Baseline 7) and the denoised signal from SPEAR.
To estimate HRV from PPG, we adapt the widely used methods of Bartels et al. (2017) and Morelli et al. (2018). First, a peak detection algorithm (Elgendi et al., 2013) detects the R-peaks. Then, a moving filter is applied to remove physiologically implausible peaks. The filtering criterion differs over studies (Bartels et al., 2017; Morelli et al., 2018); we used a filter based on inter-quartile range, which rejects R-R intervals that lie outside of the IQR of measured interval durations. The HRV metrics (SDNN and RMSSD) were computed using their respective formulae based on R-R intervals. The implementation for these approaches are available in the HeartPy library (van Gent et al., 2018).
The mean absolute error (MAE) is a widely used metric in HR estimation challenges. HR is estimated on signals segmented into sliding time windows of length 8 seconds and an overlap of 6 seconds. When the signals are split into
For heart rate variability, the same approach for computing estimation error is used, but the HRV metrics are computed over a 5-minute interval, based on the recommended time interval used for HRV (Shaffer & Ginsberg, 2017). The MAE is computed on the two HRV metrics, SDNN and RMSSD, using the ECG segments as ground truth.
Training Data Set | PPG DaLiA MAE | Stanford MAE | ||
---|---|---|---|---|
Baseline 1 |
| - | - | |
Baseline 2 |
| - | - | |
Baseline 3 |
| PPG DaLiA | ||
Baseline 4 |
| PPG DaLiA | ||
Baseline 5 |
| Stanford | ||
Baseline 6 |
| Stanford | ||
Baseline 7 |
| Stanford | ||
Our Algorithm | SPEAR | Stanford |
Note. Baselines 1 and 2 are signal processing methods and not trained on a data set. Baselines 3, 4, 5 and 6 were trained for heart rate estimation. SPEAR and Baseline 7 were trained for denoising. Best performance among all methods is in bold.
Our main result is that the proposed algorithm performs well at the task of HR estimation from PPG signals across data sets, generally outperforming baselines. The results are summarized in Table 2. On the Stanford experiment, SPEAR outperforms all baselines and achieves the lowest MAE. In the PPG DaLiA experiment, SPEAR’s MAE (5.36 bpm) is lower than every baseline except CNN+LSTM_HR_DaLiA
(4.17 bpm), which was trained to detect HR on the PPG-DaLiA data set. This shows that SPEAR’s out-of-distribution performance is on par with fully supervised approaches trained on data from the test distribution. To measure the statistical significance of these results, paired
Results further indicate that deep learning-based HR estimation models (Baselines 3–6) perform better than the signal processing approaches. The models whose architecture contains both convolutional and recurrent (LSTM) layers outperform models with purely convolutional networks. However, the direct HR estimation approaches demonstrate limited generalizability across data sets. The CNN+LSTM_HR_DaLiA
baseline (trained on the PPG DaLiA training data) performs very well on the PPG DaLiA test set, but shows reduced HR estimation accuracy on the Stanford test data. The CNN+LSTM_HR_Stanford
baseline (trained on the Stanford training data) has good performance on the Stanford test set, but poorer performance on the PPG DaLiA data set. In contrast, SPEAR generalizes well to both data sets, as it has the best performance on the Stanford data set and second-best on the PPG DaLiA data set. It is also evident that approaches trained on PPG containing simulated noise (Baselines 5–7) show poorer performance on out-of-distribution data than SPEAR. This shows that our self-supervised technique of erasing noise artifacts, instead of simulating them, generalizes better to signals that contain noise artifacts under real-world conditions. Appendix B.5 contains an analysis on constant and variable errors, providing insights on systematic biases and measurement variability in heart rate estimation of the compared methods.
Activity |
|
|
| SPEAR |
---|---|---|---|---|
Sitting | ||||
Stairs | ||||
Table Soccer | ||||
Cycling | ||||
Driving a Car | ||||
Lunch Break | ||||
Walking | ||||
Working |
Note. Mean absolute error computed in beats per minute (bpm). Best performance (lowest error) for each activity highlighted in bold.
Table 3 provides a breakdown of the HR estimation MAE results over the activities performed by the subjects. The results were computed on the PPG DaLiA data set, which provides activity labels. To ensure sufficient data is available across each of the activities, we use the full PPG DaLiA data set for testing, and report performance for only the models trained on the Stanford data set. We note that while the Stanford data set contains a large collection of subjects, it has relatively limited representation of high-intensity activities compared to PPG DaLiA. As a result, we expect relatively poorer performance on higher intensity activities from methods trained on the Stanford data. In the assessment of the activities “driving a car,” “lunch break,” “walking,” and “working,” SPEAR yields the smallest estimation errors by a considerable margin. Conversely, on the activity of “stair climbing,” the direct HR estimation methods outperform SPEAR by a considerable margin. For the remaining activities, namely “sitting,” “cycling,” and “table soccer,” while SPEAR does not attain the lowest estimation error, its mean absolute error closely approximates the most accurate baselines. The results show that SPEAR produces accurate estimates of heart rate for subjects performing routine activities. However, it shows relatively weaker accuracy on higher intensity activities like climbing stairs and cycling (though other algorithms performed comparably on cycling). For real-world applications, it is recommended that the algorithm’s training data be expanded to include a broader variety of physical activities, including both routine movements and higher intensity activities. Appendix B.6 provides a subject-wise break down of the heart rate estimation results. Appendix B.7 provides a runtime comparison between the compared methods.
Figures 4 and 5 show examples of denoised signals produced by SPEAR. Figure 6 shows two examples from the PPG DaLiA data set, along with a visualization of peak detection on the noisy and denoised signals. The figure demonstrates denoising under different conditions—in the first signal, the subject has a normal heart rate, but introduced some motion artifacts, while in the second, the subject has elevated heart rate. Further visual examples of signal denoising results are provided in Appendix C.
The main result from this experiment is that the denoised signals produced by SPEAR enhance the accuracy of heart rate variability estimation, surpassing the performance of other methodologies. Table 4 shows the mean absolute error in HRV estimation from PPG signals in our experiment. HRV is measured from the inter-beat time intervals in milliseconds; as a result, HRV is highly sensitive to noise because artifacts can cause missed or extra beats, which leads to large errors in the interval measurements. Our results confirm that previous techniques for estimating HRV from PPG signals (Bartels et al., 2017; Morelli et al., 2018) produce large errors compared to the ground truth. Denoising the signals achieves improved estimates over the original and bandpass filtered signals, as seen on both SPEAR and the baseline DAE_SimNoise
. SPEAR achieves the lowest error in both HRV metrics out of all the signals compared in the experiment. For SDNN, SPEAR achieves an improvement of approximately 62% over the original signal and 34% over the denoised signal of DAE_SimNoise
. For RMSSD, SPEAR achieves an improvement of approximately 63% over the original signal and 39% over the denoised signal of DAE_SimNoise
. This demonstrates that SPEAR’s denoising algorithm yields significant improvements on the task of HRV estimation from PPG signals.
HRV Metric | Original Signal | Filtered Signal |
| SPEAR |
---|---|---|---|---|
SDNN | ||||
RMSSD |
Note. DAE_SimNoise
and SPEAR produce denoised signals. The methods are evaluated on the following HRV metrics. The standard deviation of NN intervals (SDNN) and the root mean square of successive differences (RMSSD) in normal heart beats. The metrics are computed over 5-minute intervals.
Experimental results show that SPEAR often outperforms the state-of-the-art for HR estimation and generalizes across data sets better than other approaches. This is not unexpected: SPEAR’s self-supervised approach allows training of the reconstruction model on a significantly larger data set since it only requires clean signals to train. Other supervised approaches either require ground-truth data, or simulating artificial noise artifacts to train the models. Both cases limit the comprehensiveness of the training data. Training the model on a large and realistic data set enables learning robust denoising representations for SPEAR.
Current denoising approaches reconstruct the entire signal, whereas SPEAR preserves the clean segments. This leads to better HR estimates, particularly when the source signal may only have a small amount of corruption. This is evidenced by the activity-wise performance breakdown in Table 3, which shows that SPEAR achieves better estimates when working with daily activities like walking, working, and driving. Additionally, SPEAR outputs longer (30-second) non-overlapping denoised signals that can be joined to achieve long-term continuous recordings. In comparison, other methods generally work on smaller, overlapping segments, which makes it hard to rejoin them to obtain a continuous signal. This enables SPEAR to enhance downstream applications that may require longer continuous recordings, like HRV estimation.
PPG technology is becoming increasingly ubiquitous with the adoption of modern wearables such as smartwatches, wristbands, and smart jewelry. These devices allow users to continuously monitor heart rate throughout daily life, made possible by their simplicity of operation, cost-effectiveness, and comfort of use. Most modern wearable devices now also provide continuous HRV measurement (Hinde et al., 2021), which further enables many health care applications. PPG has several personal health applications, such as tracking blood pressure (Yoon et al., 2009) and blood oxygen saturation (Almarshad et al., 2022), monitoring sleep quality (Korkalainen et al., 2020), and guiding exercise and recovery (Singh et al., 2018; Y. Zhang et al., 2020). Continuous long-term monitoring of PPG has important clinical applications, such as diagnosing cardiovascular diseases (Allen et al., 2006) and arrhythmia (Pereira et al., 2020). PPG is susceptible to noise and these applications are consequently limited in their accuracy and reliability due to the corruption of underlying metrics obtained from the signals. Technological advances in wearable devices, quality of sensors, and software-based signal processing algorithms can improve the reliability of these applications (Kim & Baek, 2023). SPEAR can be integrated as a preprocessing step for any application that uses PPG recordings for predictive or analytical tasks. The SPEAR algorithm receives a continuous PPG signal of arbitrary duration, splits it into segments, rejects the few segments that are too corrupted to recover, and reconstructs the rest only in the noise-corrupted regions, while preserving the useful information in the rest of the signal. The reconstructed clean signal can then be rejoined and passed down to further tasks. Since these signals are now clean, they result in more reliable performance in downstream tasks.
SPEAR can be deployed in personal health applications as well as clinical settings. The algorithm can be implemented in a smartphone application that integrates with the user’s wearable. Such an integration only requires a device that provides access to PPG waveform data, such as the Empatica E4 or Actigraph wristwatches (see Hinde et al., 2021, for a review on the data afforded by various PPG enabled devices). For clinical applications, the SPEAR algorithm can be integrated into existing data-processing pipelines as a preprocessing step to produce clean PPG signals to be used in various predictive or analytical tasks. SPEAR’s code is publicly available and open source; hence, it is free to integrate into any personal or clinical application without requiring FDA approval. This serves three purposes: (1) troubleshooting can be crowdsourced, (2) it can be used as a baseline for comparison with proprietary products, and (3) developers can combine SPEAR with off-the-shelf open source algorithms that work for clean PPG. That said, SPEAR can also be built into algorithms that can apply for FDA approval and have a greater degree of trust and reliability. Most importantly, our work will provide users the ability to glean metrics on their health, without being limited to the proprietary algorithms provided by device manufacturers. This can enable the development of a variety of PPG-enabled applications available to the public.
We introduced a novel self-supervised learning method for eliminating noise artifacts and estimating heart rate from PPG signals collected from wrist-worn wearable devices. An advantage of our approach is that it only requires clean PPG signals for training, which allows us to use larger data sets without ground-truth labels. SPEAR outperforms baselines at HR and HRV estimation and generalizes well across data sets. This illustrates how SPEAR enables more accurate downstream analysis of many aspects of heart monitoring from wearables.
The authors have no conflicts of interest to declare. The present work is partially supported by NIH grant R01HL166233.
Allen, J., Overbeck, K., Stansby, G., & Murray, A. (2006). Photoplethysmography assessments in cardiovascular disease. Measurement and Control, 39(3), 80–83. https://doi.org/10.1177/002029400603900303
Almarshad, M. A., Islam, M. S., Al-Ahmadi, S., & BaHammam, A. S. (2022). Diagnostic features and potential applications of PPG signal in healthcare: A systematic review. Healthcare, 10(3), Article 547. https://doi.org/10.3390/healthcare10030547
Bartels, R., Neumamm, L., Peçanha, T., & Carvalho, A. R. S. (2017). SinusCor: An advanced tool for heart rate variability analysis. Biomedical Engineering Online, 16, Article 110. https://doi.org/10.1186/s12938-017-0401-4
Biswas, D., Everson, L., Liu, M., Panwar, M., Verhoef, B.-E., Patki, S., Kim, C. H., Acharyya, A., Van Hoof, C., Konijnenburg, M., & Van Helleputte, N. (2019). CorNET: Deep learning framework for PPG-based heart rate estimation and biometric identification in ambulant environment. IEEE Transactions on Biomedical Circuits and Systems, 13(2), 282–291. https://doi.org/10.1109/TBCAS.2019.2892297
Bradley, G. R., & Kyriacou, P. A. (2024). Opening the envelope: Efficient envelope-based PPG denoising algorithm. Biomedical Signal Processing and Control, 88(Part A), Article 105693. https://doi.org/10.1016/j.bspc.2023.105693
Castaneda, D., Esparza, A., Ghamari, M., Soltanpur, C., & Nazeran, H. (2018). A review on wearable photoplethysmography sensors and their potential future applications in health care. International Journal of Biosensors & Bioelectronics, 4(4), Article 195. https://doi.org/10.15406/ijbsbe.2018.04.00125
Chang, X., Li, G., Xing, G., Zhu, K., & Tu, L. (2021). DeepHeart: A deep learning approach for accurate heart rate estimation from PPG signals. ACM Transactions on Sensor Networks, 17(2), Article 14. https://doi.org/10.1145/3441626
Charlton, P. (2022). RRest. GitHub. Retrieved September 13, 2022, from https://github.com/peterhcharlton/RRest
Comtois, G., Mendelson, Y., & Ramuka, P. (2007). A comparative evaluation of adaptive noise cancellation algorithms for minimizing motion artifacts in a forehead-mounted wearable pulse oximeter. In Proceedings of the 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 1528–1531). IEEE. https://doi.org/10.1109/IEMBS.2007.4352592
Elgendi, M., Norton, I., Brearley, M., Abbott, D., & Schuurmans, D. (2013). Systolic peak detection in acceleration photoplethysmograms measured from emergency responders in tropical conditions. PLoS One, 8(10), Article e76585. https://doi.org/10.1371/journal.pone.0076585
Galli, A., Narduzzi, C., & Giorgi, G. (2018). Measuring heart rate during physical exercise by subspace decomposition and kalman smoothing. IEEE Transactions on Instrumentation and Measurement, 67(5), 1102–1110. https://doi.org/10.1109/TIM.2017.2770818
Goh, C.-H., Tan, L. K., Lovell, N. H., Ng, S.-C., Tan, M. P., & Lim, E. (2020). Robust PPG motion artifact detection using a 1-D convolution neural network. Computer Methods and Programs in Biomedicine, 196, Article 105596. https://doi.org/10.1016/j.cmpb.2020.105596
Guo, Z., Ding, C., Hu, X., & Rudin, C. (2021). A supervised machine learning semantic segmentation approach for detecting artifacts in plethysmography signals from wearables. Physiological Measurement, 42(12), Article 125003. https://doi.org/10.1088/1361-6579/ac3b3d
Hinde, K., White, G., & Armstrong, N. (2021). Wearable devices suitable for monitoring twenty four hour heart rate variability in military populations. Sensors, 21(4), Article 1061. https://doi.org/10.3390/s21041061
Kasambe, P. V., & Rathod, S. S. (2015). VLSI wavelet based denoising of PPG signal. Procedia Computer Science, 49(1), 282–288. https://doi.org/10.1016/j.procs.2015.04.254
Kim, K. B., & Baek, H. J. (2023). Photoplethysmography in wearable devices: A comprehensive review of technological advances, current challenges, and future directions. Electronics, 12(13), Article 2923. https://doi.org/10.3390/electronics12132923
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. ArXiv. https://doi.org/10.48550/arXiv.1412.6980
Korkalainen, H., Aakko, J., Duce, B., Kainulainen, S., Leino, A., Nikkonen, S., Afara, I. O., Myllymaa, S., Töyräs, J., & Leppänen, T. (2020). Deep learning enables sleep staging from photoplethysmogram for patients with suspected sleep apnea. Sleep, 43(11), Article zsaa098. https://doi.org/10.1093/sleep/zsaa098
Lam, E., Aratia, S., Wang, J., & Tung, J. (2020). Measuring heart rate variability in free-living conditions using consumer-grade photoplethysmography: Validation study. JMIR Biomedical Engineering, 5(1), Article e17355. https://doi.org/10.2196/17355
Lee, J., Sun, S., Yang, S. M., Sohn, J. J., Park, J., Lee, S., & Kim, H. C. (2019). Bidirectional recurrent auto-encoder for photoplethysmogram denoising. IEEE Journal of Biomedical and Health Informatics, 23(6), 2375–2385. https://doi.org/10.1109/JBHI.2018.2885139
Lin, W.-H., Ji, N., Wang, L., & Li, G. (2019). A characteristic filtering method for pulse wave signal quality assessment. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 603–606). IEEE. https://doi.org/10.1109/EMBC.2019.8856811
Lu, S., Zhao, H., Ju, K., Shin, K., Lee, M., Shelley, K., & Chon, K. H. (2008). Can photoplethysmography variability serve as an alternative approach to obtain heart rrate variability information? Journal of Clinical Monitoring and Computing, 22(1), 23–29. https://doi.org/10.1007/s10877-007-9103-y
Mishra, B., & Nirala, N. S. (2020). A survey on denoising techniques of PPG signal. In 2020 IEEE International Conference for Innovation in Technology. IEEE. https://doi.org/10.1109/INOCON50539.2020.9298358
Morelli, D., Bartoloni, L., Colombo, M., Plans, D., & Clifton, D. A. (2018). Profiling the propagation of error from PPG to HRV features in a wearable physiological-monitoring device. Healthcare Technology Letters, 5(2), 59–64. https://doi.org/10.1049/htl.2017.0039
Pan, H., Temel, D., & AlRegib, G. (2016). HeartBEAT: Heart beat estimation through adaptive tracking. In 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (pp. 587–590). IEEE. https://doi.org/10.1109/BHI.2016.7455966
Panwar, M., Gautam, A., Biswas, D., & Acharyya, A. (2020). PP-net: A deep learning framework for PPG-based blood pressure and heart rate estimation. IEEE Sensors Journal, 20(17), 10000–10011. https://doi.org/10.1109/JSEN.2020.2990864
Park, J., Seok, H. S., Kim, S.-S., & Shin, H. (2022). Photoplethysmogram analysis and applications: An integrative review. Frontiers in Physiology, 12, Article 808451. https://doi.org/10.3389/fphys.2021.808451
Peng, F., Zhang, Z., Gou, X., Liu, H., & Wang, W. (2014). Motion artifact removal from photoplethysmographic signals by combining temporally constrained independent component analysis and adaptive filter. BioMedical Engineering OnLine, 13, Article 50. https://doi.org/10.1186/1475-925X-13-50
Pereira, T., Tran, N., Gadhoumi, K., Pelter, M. M., Do, D. H., Lee, R. J., Colorado, R., Meisel, K., & Hu, X. (2020). Photoplethysmography based atrial fibrillation detection: A review. NPJ Digital Medicine, 3(3), Article 3. https://doi.org/10.1038/s41746-019-0207-9
Reiss, A., Indlekofer, I., Schmidt, P., & Van Laerhoven, K. (2019). Deep PPG: Large-scale heart rate estimation with convolutional neural networks. Sensors, 19(14), Article 3079. https://doi.org/10.3390/s19143079
Salehizadeh, S. M. A., Dao, D., Bolkhovsky, J., Cho, C., Mendelson, Y., & Chon, K. H. (2016). A novel time-varying spectral filtering algorithm for reconstruction of motion artifact corrupted heart rate signals during intense physical activities using a wearable photoplethysmogram sensor. Sensors, 16(1), Article 10. https://doi.org/10.3390/s16010010
Schäck, T., Sledz, C., Muma, M., & Zoubir, A. M. (2015). A new method for heart rate monitoring during physical exercise using photoplethysmographic signals. In Proceedings of the 23rd European Signal Processing Conference, (pp. 2666–2670). IEEE. https://doi.org/10.1109/EUSIPCO.2015.7362868
Schmidt, P., Reiss, A., Duerichen, R., Marberger, C., & Van Laerhoven, K. (2018). Introducing WESAD, a multimodal dataset for wearable stress and affect detection. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, (pp. 400–408). Association for Computing Machinery. https://doi.org/10.1145/3242969.3242985
Shaffer, F., & Ginsberg, J. P. (2017). An overview of heart rate variability metrics and norms. Frontiers in Public Health, 5, Article 258. https://doi.org/10.3389/fpubh.2017.00258
Shyam, A., Ravichandran, V., Preejith, S., Joseph, J., & Sivaprakasam, M. (2019). PPGnet: Deep network for device independent heart rate estimation from photoplethysmogram. In Proceedings of the 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 1899–1902). IEEE. https://doi.org/10.1109/EMBC.2019.8856989
Singh, N., Moneghetti, K. J., Christle, J. W., Hadley, D., Froelicher, V., & Plews, D. (2018). Heart rate variability: An old metric with new meaning in the era of using mHealth technologies for health and exercise training guidance. Part two: Prognosis and training. Arrhythmia and Electrophysiology Review, 7(4), 247–255. https://doi.org/10.15420/aer.2018.30.2
Temko, A. (2017). PPG-based heart rate estimation using Wiener filter, phase vocoder and Viterbi decoding. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1013–1017). IEEE. https://doi.org/10.1109/ICASSP.2017.7952309
Torres-Soto, J., & Ashley, E. A. (2020). Multi-task deep learning for cardiac rhythm detection in wearable devices. NPJ Digital Medicine, 3, Article 116. https://doi.org/10.1038/s41746-020-00320-4
Tsipouras, M. G., & Fotiadis, D. I. (2004). Automatic arrhythmia detection based on time and time–frequency analysis of heart rate variability. Computer Methods and Programs in Biomedicine, 74(2), 95–108. https://doi.org/10.1016/S0169-2607(03)00079-8
Tsuji, H., Larson, M. G., Venditti, F. J., Manders, E. S., Evans, J. C., Feldman, C. L., & Levy, D. (1996). Impact of reduced heart rate variability on risk for cardiac events. Circulation, 94(11), 2850–2855. https://doi.org/10.1161/01.CIR.94.11.2850
van Gent, P., Farah, H., Nes, N., & Arem, B. (2018). Heart rate analysis for human factors: Development and validation of an open source toolkit for noisy naturalistic heart rate data. In N. Van Ness, & C. Vorgelé (Eds.), Proceedings of The 6th HUMMANIST Conference. Association for Computing Machinery.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (pp. 1096–1103). Association for Computing Machinery. https://doi.org/10.1145/1390156.1390294
Wang, D., Gong, B., & Wang, L. (2023). On calibrating semantic segmentation models: Analyses and an algorithm. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 23652–23662). IEEE. https://doi.ieeecomputersociety.org/10.1109/CVPR52729.2023.02265
Weisstein, E. W. (2004). Bonferroni correction. Wolfram Mathworld. https://mathworld.wolfram.com/BonferroniCorrection.html
Wu, C.-C., Chen, I.-W., & Fang, W.-C. (2017). An implementation of motion artifacts elimination for PPG signal processing based on recursive least squares adaptive filter. In Proceedings of the 2017 IEEE Biomedical Circuits and Systems Conference (pp. 316–319). IEEE. https://doi.org/10.1109/BIOCAS.2017.8325141
Yoon, Y., Cho, J. H., & Yoon, G. (2009). Non-constrained blood pressure monitoring using ECG and PPG for personal healthcare. Journal of Medical Systems, 33(4), 261–266. https://doi.org/10.1007/s10916-008-9186-0
Zhang, Y., Weaver, R. G., Armstrong, B., Burkart, S., Zhang, S., & Beets, M. W. (2020). Validity of wrist-worn photoplethysmography devices to measure heart rate: A systematic review and meta-analysis. Journal of Sports Sciences, 38(17), 2021–2034. https://doi.org/10.1080/02640414.2020.1767348
Zhang, Z., Pi, Z., & Liu, B. (2015). TROIKA: A general framework for heart rate monitoring using wrist-type photoplethysmographic signals during intensive physical exercise. IEEE Transactions on Biomedical Engineering, 62(2), 522–531. https://doi.org/10.1109/TBME.2014.2359372
Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., & He, Q. (2021). A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1), 43–76. https://doi.org/10.1109/JPROC.2020.3004555
In this section, we describe the implementations of SPEAR’s model architecture and the baseline architectures used in our experiments.
The denoising autoencoder in the SPEAR algorithm uses a convolutional neural network architecture. Figure A1 illustrates the model architecture for the autoencoder. The network is summarized as follows:
The encoder consists of four convolutional blocks. Each block consists of a 1D convolutional layer with ReLU activation, followed by batch normalization. Each of the encoder conv layers has a stride of 2 and zero padding.
The encoder layers have 16, 32, 64, and 128 filters, respectively. The kernel sizes are 32, 64, 128, and 320, respectively.
The decoder network consists of five convolutional blocks. Each of the first four blocks consists of 1D convolutional transpose layer with ReLU activation followed by batch normalization. Each of the first four conv layers has a stride of 2 and zero padding.
The decoder layers have 128, 64, 32, and 16 filters, respectively. The kernel sizes are 320, 128, 64, and 32, respectively.
The final block consists of a convolutional layer with a single filter, kernel size 3 and stride of 1. This layer uses sigmoid activation. The output of this layer is the denoised signal of the same dimension as the input.
Model hyperparameters were optimized via grid search over values for the kernel sizes, number of filters, and number of convolutional layers in the encoder and decoder networks. The hyperparameters were optimized based on the downstream task of minimizing mean absolute error in HR estimation on the Stanford validation data set. The evaluated values for kernel sizes were
Baseline 3 CNN_HR_DaLiA
and Baseline 5 CNN_HR_Stanford
use a convolutional neural network for HR prediction. They use an identical model architecture. The model consists of two Convolutional-ReLU-MaxPool-Dropout blocks. The convolution layers are one-dimensional and have a kernel size of 9 and 64 and 32 filters, respectively. The max pooling layer had a size of 4 and dropout was used with probability 0.1. The two blocks are followed by a fully connected layer that outputs a single HR prediction label. The training data consisted of overlapping time windows of PPG signals: 8-second windows were generated in a sliding window fashion with a 2s interval (6 s of overlap). The model was trained for 100 epochs.
Baseline 4 and Baseline 6 (CNN+LSTM_HR_DaLiA
, CNN+LSTM_HR_Stanford
, respectively) use a combination of convolutional and long-short-term-memory (LSTM) layers. The model architecture contains two Convolutional-ReLU-MaxPool-Dropout blocks, with the same hyperparameters as defined for the convolutional-only models. These blocks are followed by two LSTM layers with a hyperbolic tangent activation function. The LSTM layers have 64 and 128 units, respectively. Finally, a fully connected layer is added, which outputs a prediction label for the HR. The model was trained for 100 epochs.
Method | PPG DaLiA | Stanford | |
---|---|---|---|
Baseline 1 |
| - | |
Baseline 2 |
| - | |
Baseline 3 |
| ||
Baseline 4 |
| ||
Baseline 5 |
| ||
Baseline 6 |
| ||
Baseline 7 |
| ||
Our Algorithm | SPEAR |
In this section, we perform ablation studies to verify the effectiveness of each component of SPEAR’s algorithm design. We define four variants of SPEAR:
SPEAR-LSTM
has the same architecture as SPEAR, but adds an LSTM layer in the encoder network. Since LSTM-based architectures had superior performance in the HR estimation baselines (CNN+LSTM_HR_DaLiA
and CNN+LSTM_HR_Stanford
), this comparison baseline was used to see if any similar improvements would be found in our denoising model as well.
SPEAR-N
is trained such that the artifact locations are replaced with Gaussian noise instead of setting it to 0. In this case, the autoencoder does not receive a 0 signal at the location of the artifact, so it reconstructs the full signal, not just the corrupted regions.
SPEAR-Sm
follows the same training procedure as SPEAR, but it uses smaller kernel sizes of convolutional layers in the model architecture.
SPEAR-L
removes the first two convolutional layers from the encoder and last two layers from the decoder in SPEAR’s model architecture.
The results are shown in Table B2, indicating that SPEAR is not sensitive to changes in kernel size or the addition of gaussian noise in the input, but it is sensitive to major ablations such as removal of the convolutional layers. We also see that adding an LSTM layer to the SPEAR architecture does not offer any improvements, and as such does not offer the same benefits of added complexity that were seen in the HR estimation baselines.
Model | DaLiA MAE | Stanford MAE |
---|---|---|
| ||
| ||
| ||
| ||
SPEAR |
Note. The mean absolute error (MAE) values are in beats per minute (bpm). The best algorithm’s performance is highlighted in bold.
The first step in the SPEAR algorithm for denoising signals is to determine the noise levels in the signal and discard any signal segments that are too corrupted. In the analysis section above, we use a noise threshold of
We conducted this experiment on the PPG DaLiA dataset since it provides ground-truth ECG data. Table B3 shows the result of this experiment. We can see that most signals in this dataset are fairly noisy. The intuitive inverse relationship between the accuracy of HR estimation and how much of the noisy signal is removed are evident from the table. There isn’t a clear “best” choice of noise threshold. In fact, the ideal choice of noise threshold should depend on the use-case. For applications that require high accuracy, such as for healthcare diagnosis, a lower noise threshold (like 0.5) would be preferable. On the other hand, for daily tracking, we can choose to preserve more of the recordings with a larger threshold (like 0.75) and provide HR estimates that are slightly less accurate (though still more accurate than other state-of-the-art methods).
We conducted another experiment to compute the Reconstruction Ratio: a metric we defined to measures the degree of signal reconstruction by SPEAR. The Reconstruction Ratio represents the percentage of the input signal that SPEAR identifies as being corrupted by noise artifacts and subsequently reconstructs. Intuitively, we would expect the reconstruction ratio to increase as the noise threshold increases, since we allow signals with greater level of corruption to be denoised. Table B4 reports this metric for the PPG DaLiA and the Stanford test data sets.
Noise Threshold | Reduction in Signal Recordings ( | Heart Rate MAE (bpm) |
---|---|---|
0.5 | 54% | 3.73 |
0.6 | 46% | 4.45 |
0.75 | 32% | 5.36 |
0.85 | 22% | 6.10 |
Note. Tested on the PPG DaLiA data set. The second column shows the percentage reduction in the number of signal recordings for a given noise threshold. The third column shows the mean absolute error in HR estimation on the remaining signals.
Noise Threshold | PPG DaLiA RR (%) | Stanford RR (%) |
---|---|---|
0.5 | 22.3% | 20.4% |
0.6 | 28.2% | 23.6% |
0.75 | 36.0% | 26.9% |
0.85 | 42.1% | 29.2% |
The artifact detection step of SPEAR uses a noise detection model to locate the artifacts within a signal. We chose to use the Segade model (Guo et al., 2021) for this step. Segade is a segmentation model that is able to accurately locate the exact locations of noise artifacts within a region. It outperforms all other methods on the Dice score metric, which measures segmentation accuracy. Furthermore, it generalizes across multiple data sets and is available as an open-source repository. Hence, it makes for an appropriate candidate in the SPEAR algorithm.
We conducted an experiment to test the sensitivity of HR estimation to the performance of the noise detection algorithm. In the PPG DaLiA test set, for each signal, we added some false positives to Segade’s predictions. Effectively, this would increase the identified ‘noise’ in the signal by predicting some of the clean regions as noisy. Then, these predictions are passed down the SPEAR pipeline and heart rate is predicted after denoising. We found that by randomly adding 5% of false positives to the signals, the mean absolute error in HR increases from 5.36 to 5.97 bpm. This meager increase can be attributable to the fact that there is now more ’noise’ in the signal the algorithm must reconstruct. But since SPEAR is able to reconstruct clean signals in the deleted noisy regions, this shows that the performance is not sensitive to false positives in the noise detector’s performance.
Since our approach does not attempt to reconstruct clean parts of the signal, a limitation of SPEAR is that it may be potentially sensitive to false negatives in the noise detector’s predictions. Transfer learning (Zhuang et al., 2021) can improve performance of the noise detection model on new data, if some noise artifact labels are available. Postprocessing model calibration (Wang et al., 2023) can be used to obtain empirical prediction probabilities; the model can be tuned to classify only higher probability predictions as noise, thus decreasing the false negative rate. Though this may come at the cost of higher false positives, we showed that SPEAR is robust to false positives. Since Segade achieves good accuracy on our experimental data, experimental results on these approaches are deemed out-of-scope for this article.
The Signal Reconstruction step of SPEAR removes the noisy regions of the signal and imputes them using a denoising autoencoder model to construct a clean signal. However, this raises a question whether the imputation step is required at all; what if we just delete the noise artifacts and compute HR from the remaining signals? We conducted an experiment to analyze this. We define a baseline SPEAR-NoImpute
that runs the noise detection model and deletes the noise-corrupted regions from the signals. Then it rejoins the remaining signal to form a continuous signal (this joined signal could have discontinuities of its own since they were joined at arbitrary positions). Then the corresponding matching ECG recordings are used to compute the ground truth. We ran the set of experiments from Section 4.3 to compare performance on HR estimation tasks with SPEAR.
We compare the results of the experiment for both Stanford and DaLiA test data sets in Table 6. The results show that SPEAR-NoImpute
performs significantly worse on heart rate estimation than SPEAR. This shows that the imputation step is indeed an important step for signal denoising. The reason for this is that signals can contain many noise artifacts at arbitrary locations (for example, Figures C3 and C4]), and simply joining the signal segments can lead to unpredictable positions of the R-peaks. Though the individual segments are clean, joining them together could lead to discontinuity at various points in the signal, leading to errors in peak detection and HR estimation.
Method | Stanford MAE (bpm) | DaLiA MAE (bpm) |
---|---|---|
| 9.08 | 11.23 |
SPEAR | 3.18 | 5.36 |
In this section, we report and analyze the constant and variable error results from the heart rate estimation experiments. Constant error (CE) is the mean of the measurement errors, taking into account the directionality of the error. It helps identify systematic biases in the model that may consistently skew measurements in a particular direction. Variable error (VE) is the standard deviation of the measurement errors. It provides insights on the consistency and stability (or lack thereof) of the measurements. Table B1 reports the results on the constant and variable errors of the compared methods from the heart rate estimation experiments on the PPG DaLiA and Stanford data sets.
SPEAR has a CE of
The VE of SPEAR is the lowest on the Stanford data set, which shows that its HR estimation results are relatively more consistent than other methods. On the PPG DaLiA experiment, SPEAR’s VE (9.9 bpm) is lower than all methods except CNN+LSTM_HR_DaLiA
(6.4 bpm). So, SPEAR has relatively lower bias but relatively higher variability on the PPG DaLiA experiment than the best-performing method, which was trained on data from the test distribution.
Table B6 provides a subject-wise breakdown of HR estimation results for the full PPG DaLiA data set. For fairness of comparison, we include only the baselines that did not use the PPG DaLiA data set for training. SPEAR achieved the lowest HR estimation error on 13 out of 15 subjects, and the second-lowest error on the other 2. This indicates that the results for SPEAR’s performance on the heart rate estimation experiments show consistency across subjects.
Subject ID |
|
|
|
|
| SPEAR |
---|---|---|---|---|---|---|
S1 | ||||||
S2 | ||||||
S3 | ||||||
S4 | ||||||
S5 | ||||||
S6 | ||||||
S7 | ||||||
S8 | ||||||
S9 | ||||||
S10 | ||||||
S11 | ||||||
S12 | ||||||
S13 | ||||||
S14 | ||||||
S15 | ||||||
Note. Subjects belong to the data set from the PPG DaLiA study. The mean absolute error (MAE) and standard deviation (SD) values are reported in beats per minute (bpm). The best algorithm’s performance for each subject is highlighted in bold.
In this section, we compare the runtime of each of the algorithms. We ran each of the algorithms on the full PPG DaLiA and Stanford test data sets. On the PPG DaLiA data set, we measured the average runtime of the algorithm over the subjects’ full recordings. On the Stanford data set, we compute the runtime of the algorithms over the entire test data set, averaged over 10 runs. The experiments were conducted on a 2021 MacBook Pro with the M1 Pro Chip with 16 GB of RAM.
Figure B2 reports the runtime comparison. Results show that the purely convolutional ML approaches are faster than the approaches containing recurrent (LSTM) layers. On a relatively smaller recording, SPEAR’s runtime is better than other baselines (except Kalman Filtering
), but on the large Stanford data set, SPEAR tends to take longer than the baselines. The comparatively longer runtimes of SPEAR and DAE_SimNoise
can be explained by the fact that they are generative algorithms that perform signal reconstruction and use a model architecture involving encoder and decoder networks.
In this section, we provide denoising results from the SPEAR algorithm. Figures C1–C7 display these results. Results are provided for out-of-sample signals obtained from the PPG DaLiA and Stanford test sets.
Each PPG signal is 30 seconds long and sampled at 64 Hz.
The first signal in each result is the original signal from the data set. These signals contain some noise artifacts, which are highlighted in red (noise regions are predicted by the Segade model).
The second signal in each result is the denoised signal produced by SPEAR. It can be observed that the signal is only reconstructed in the noise-corrupted regions, the rest of the signal is unaltered.
The third signal in each result visualizes the denoised signal after bandpass filtering and peak detection. Bandpass filtering makes the signal smoother, which improves the performance of peak detection algorithms. The green dots in the signal show the peaks that were detected by the peak detection algorithm.
For the examples from the PPG DaLiA data set, the fourth signal shows the synchronously recorded ECG signal, which is used as ground truth. These signals are included to demonstrate the efficacy of the peak detection on the denoised signal.
Note that in some cases the peaks of the uncorrupted regions may appear larger in the denoised version than the original—this is because the artifacts in the original signal may have greater amplitude, which causes the rest of the signal to have ostensibly smaller peaks. When these artifacts are removed, the peaks in the clean signal appear larger.
The denoising examples from the Stanford data set consist of signals that are corrupted with real noise artifacts. This is in contrast with the Stanford experiment in Section 4.3, where simulated noise was added to generate clean-noisy pairs. For visualization purposes, we demonstrate real noisy signals from the Stanford data set.
Code for SPEAR is publicly available at https://github.com/pranay-jain/SPEAR-PPG-Denoiser.
The code for the Segade noise artifact detection model (Guo et al., 2021) can be found at https://github.com/chengstark/Segade.
The code for Baseline 1 (Temko, 2017) can be found at https://github.com/andtem2000/PPG. The code for Baseline 2 (Galli et al., 2018) can found at https://github.com/AlessandraGalli/PPG.
©2024 Pranay Jain, Cheng Ding, Cynthia Rudin, and Xiao Hu. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.