The integration of wrist-worn wearables to assess physiological parameters, such as heart rate (HR) within psychotherapy is increasing due to their intuitive handling and their ability to provide real-time data. However, little is known about the reliability and validity of different wearable parameters (e.g., HR, stress rate) and to what extent the degree of emotional activation associated with different types of intervention influences the accuracy of these parameters.
Methods
12,330 thirty-second segments from 159 treatment sessions addressing test anxiety conducted with 26 participants were analyzed. Each session contained emotion-focused and cognitive-oriented interventions. The accuracy of the wearable’s (Garmin Vivosmart 4) HR and stress rate (SR) indices was compared to stationary gold-standard measurements (Electrocardiogram, ECG and Electrodermal Activity, EDA) by calculating Intraclass Correlations (ICC). Using multilevel prediction analyses for next-session outcomes, we examined and compared the predictive validity of client stationary and wearable physiological parameters.
Results
A very good agreement between wearable HR and lab ECG HR, as well as wearable SR and wearable HR was found. The agreement between wearable SR and lab ECG HR was good, but that between wearable SR, HR and lab EDA was poor. Wearable HR and lab ECG HR were more coherent with each other during cognitive-oriented compared to emotion-focused interventions. Regarding the predictive validity, client HR was positively associated with their next-session symptom severity levels, regardless of the measurement method and intervention type.
Conclusions
The results suggest that physiological parameters capturing activating and regulatory information may offer valuable insights into psychotherapeutic change processes. In this context, Garmin’s wearable HR may be a valid measure to investigate specific research questions when stationary measurement is not feasible. Nevertheless, existing limitations of wearable parameters are discussed.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A growing literature on physiological processes during psychotherapy demonstrates the relevance of examining physiological activity and its association to treatment outcome (Deits-Lebehn et al., 2020; Kleinbub, 2017). Physiological data contains information about biological components of cognitive, emotional and behavioral processes that can often occur unconsciously within clients and cannot be collected with self-reports or observational ratings. Parameters of the autonomic nervous system (ANS) are particularly suitable for analyzing emotional responsiveness and have been most extensively studied in association to therapeutic settings (Del Piccolo & Finset, 2018). These include for instance electrodermal activity (EDA), heart rate variability (HRV) and heart rate (HR). EDA is solely under the control of the sympathetic branch of the ANS and is particularly sensitive to arousal stemming from emotional and cognitive processes, regardless of conscious awareness (Dawson et al., 2007; Sequeira et al., 2009). Among the various HRV measures, the root mean square of successive differences between normal heartbeats (RMSSD) is frequently used to assess parasympathetic activity. Lower RMSSD values are often linked to negative affect, reflecting reduced parasympathetic influence (for a review, see Dufey et al., 2023). In contrast, HR data allow for the examination of both sympathetic and parasympathetic functioning, and accompanying emotional regulation (Thayer et al., 2009). HR increases are typically associated with sympathetic activation, signifying heightened vigilance, active avoidance, and negative emotions (e.g., anxiety). Conversely, HR decreases with parasympathetic activation (Berntson et al., 2007), indicating positive emotions and enhanced cognitive processing (Tremayne, 1990).
In the literature, associations between client physiological processes and treatment outcome in psychotherapy are particularly investigated in studies on exposure therapy, with inconsistent results. Outcomes in imaginal exposure therapy were found to be positively associated with higher initial physiological activity in clients (e.g., Halligan et al., 2006; Kozak et al., 1988). Conversely, mid-treatment exposure outcomes were negatively associated with clients’ higher physiological arousal, as indicated by elevated EDA, during the exposure (McCormack et al., 2020). However, as EDA is regulated by the sympathetic branch of the ANS, it is not possible to draw any conclusions about the regulatory influences of the parasympathetic branch (Prinz et al., 2021). In contrast, examining HR, which is controlled by both branches of the ANS, allows for such conclusions about regulatory mechanisms that mainly play a role in emotion-focused interventions (e.g., Prinz et al., 2022). Several studies found a significant correlation between EDA and HR (e.g., Kettunen et al., 1998; Lazarus et al., 1963), leading to the assumption of a “[…] common central mediating mechanism that integrates sympathetic and parasympathetic controls […]” (Kettunen et al., 1998, p. 222). However, specific circumstances, such as small measurement intervals, seem to be necessary to find this association (Kettunen et al., 1998).
When collecting physiological data during psychotherapy using stationary gold-standard devices, however, several disadvantages must be considered. First, the positioning of electrodes on the upper body or on the non-dominant hand affects the natural setting of psychotherapy and restricts both client and therapist in their natural (presumably also emotionally regulatory) movements. Second, data preparation requires a lot of time and expertise, as each data set must be visually inspected and manually edited. Third, technical problems and poor data, which cannot be analyzed, are common issues. Fourth, the purchase and servicing of the devices and associated software are expensive. As a result, stationary physiological measurements are used in only a few studies and not in routine assessments like questionnaires.
Recent technological developments such as wearables facilitate the integration of physiological parameters into the field. Wearables are unobtrusive devices, do not restrict movements or comfort (Pasadyn et al., 2019), and are ideally suited for long-term surveys (Menghini et al., 2019) with the possibility to be integrated into the client’s everyday life. For example, Hehlmann et al. (2021) investigated abrupt and gradual changes in individual stress levels in client’s everyday life using wearables during a two-week Ecological Momentary Assessment (EMA) period at the beginning of treatment. The results of their case-study showed that a reduced rate of abrupt changes in stress level was associated with a greater level of self-reported symptoms at session 15. In addition, Siddi et al. (2023) used a wearable device to study heart rate in clients with recurrent depression. They found that reduced variation in HR during daytime rest periods and higher HR during the night were associated with more severe depressive symptoms.
Besides investigating the physiology of clients in their everyday life, wearables are also used to index physiological parameters, such as HR, HRV, and EDA, during psychotherapeutic sessions to examine their association with alliance ratings (e.g., Tal et al., 2023), symptom severity (Christian et al., 2023), and treatment outcome (Gernert et al., 2024). Some of these studies found an incremental predictive value of physiological parameters in predicting psychotherapeutic processes, such as treatment outcome (e.g., Gernert et al., 2024, but see Tal et al., 2023). Despite these promising findings, the reliability and validity of physiological data collected with wearable devices are not entirely clear. A few studies examined the quality of wearable-derived data, leading to device-dependent heterogeneous results. For instance, Miller et al. (2022) compared the HR and HRV data of several wearables to gold-standard ECG measurements during night-time sleep. The intraclass correlations ranged from moderate to almost perfect agreement with ECG, highly depending on the wearable device. Similarly, in their review, Evenson and Spade (2020) reported varying agreements between HR data derived from several wearables and ECG assessments, reaching from low to excellent agreements. However, in two reviews the accuracy of wearable-derived HR was summarized across several devices as acceptable (Fuller et al., 2020; Nelson et al., 2020). Besides such device-immanent differences, another reason for such varying measurement accuracy of wearable-derived HR is the impact of physical activity, with studies consistently finding lower agreement with ECG during physical activity compared to resting states (e.g., Evenson & Spade, 2020; Menghini et al., 2019). Wearables usually use optical sensors, called Photoplethysmography (PPG) to detect HR. In this technique, an LED illuminates the skin and measures the amount of light reflected by the body tissue, which depends in part on the volume of arteries near the skin surface. Consequently, the amount of reflected light varies with the arterial pulses, and this variation is used to estimate HR and HRV parameters (Collins et al., 2019). However, this optical sensor technique is sensitive to (small) movements, skin temperature, skin color (e.g., dark tattoos) and aging (e.g., Collins et al., 2019; Evenson & Spade, 2020). To minimize these influences, several user instructions are provided, including the position and fit of wearables (e.g., Garmin, 2024).
Several wearable devices (e.g., Garmin, Whoop) additionally provide individual stress levels (SR) by using the PPG derived HR and HRV for their algorithm-based calculation.1 For example, Garmin relies on the algorithm from Firstbeat Technologies (Firstbeat Technologies Ltd., 2014). To our knowledge, no study has evaluated the reliability of these inferred stress ratings.
To summarize, alternative technological options to stationary measurements, such as wrist-worn wearables, exist to investigate physiological parameters. While these wearable-derived parameters (i.e., HR, SR) are already implemented in prediction models (e.g., Hehlmann et al., 2021; Siddi et al., 2023), there is still limited knowledge about their validity and reliability of these parameters, while wearable SR parameters have not been validated at all. Moreover, the reliable application in the specific context of psychotherapy requires further validation. Psychotherapy is characterized by different states of emotional activation. It remains unclear whether these states influence the quality of measurements obtained from wearables.
The aim of this study was to validate wearable-derived parameters of physiological arousal – namely HR and SR – by comparing them with stationary gold-standard measurements of ECG and EDA. Additionally, we aimed to examine and compare client physiological arousal indicators, derived from both wearable and stationary devices, as predictors of psychotherapeutic change processes between sessions, while accounting for the degree of emotional activation. Moreover, we exploratively examined how the degree of emotional activation might influence the accuracy of wearable HR and SR measurements.
Concretely, we assessed HR and SR collected using the wrist-worn wearable Garmin Vivosmart 4. Additionally, HR was measured with a gold-standard lab ECG, from which the Root mean square of successive differences (RMSSD) was also derived. EDA was recorded using a gold-standard laboratory device. Data were collected from 16 clients and 10 therapists participating in a six-session protocol-based treatment for test anxiety. Each session included a baseline period, an emotion-focused intervention, and a cognitive-oriented intervention.
The following hypotheses guided our work:
Hypothesis 1
Wearable HR: (A) Based on previous validation studies overall finding acceptable accuracy rates of wearable-derived HR (e.g., Fuller et al., 2020; Nelson et al., 2020), we expected a moderate intra-class correlation (ICC) between wearable HR and lab ECG HR. (B) Based on results showing an association between HR and EDA parameters (Kettunen et al., 1998; Lazarus et al., 1963), we additionally expected to observe a moderate ICC between wearable HR and stationary lab EDA. (C) Due to heightened sympathetic activity and reduced parasympathetic (vagal) tone during physiological arousal, we expected a moderate negative correlation between wearable HR and lab ECG RMSSD.
Hypothesis 2
Wearable SR: (A) Since wearable stress rates are calculated based on HR and HRV, which are controlled by both branches of the ANS, we expected to observe a moderate ICC between wearable SR and both lab ECG HR and (B) lab EDA (Kettunen et al., 1998; Lazarus et al., 1963). (C) Due to heightened sympathetic activity and reduced parasympathetic (vagal) tone during physiological arousal, we expected a moderate negative correlation between wearable SR and lab ECG RMSSD.
Hypothesis 3
Predictive validity: As several studies have shown that client physiological activity was predictive of treatment outcome in emotion-focused interventions (Halligan et al., 2006; Kozak et al., 1988; MacCormack et al., 2020), we expected that both stationary and wearable indicators of client physiological arousal during emotion-focused interventions—but not during baseline or cognitive-oriented interventions—would be associated with next-session outcome.2
Moreover, since no previous studies are known, we exploratively investigate the impact of session content on the accuracy of the wearable-derived physiological parameters. Specifically, our further goal is to examine the extent to which the degree of emotional activation affects the accuracy of wearable HR and SR.
Materials and Methods
Study Overview
The sample consisted of 26 participants (16 clients and 10 therapists) who had participated in an open-trial study targeting test anxiety between 2023 and 2024 at a university outpatient clinic in southwest Germany. In this trial, clients were treated with a six-session treatment protocol, combining cognitive-oriented as well as emotion-focused interventions (Prinz et al., 2016, 2019). The treatment protocol is freely available at www.osf.io/hraqd. A brief description of the intervention types is given below; a detailed description can be seen in Prinz et al. (2019). Before the first session took place, participants were provided with a study information sheet and asked to give written informed consent. They were briefed on the general objectives of the study, which involved evaluating an innovative treatment manual focused on emotions using Imagery Rescripting (IR) to address test anxiety. However, clients and therapists were kept unaware of the specific hypotheses of the study and of the assessment of emotional arousal using objective measures. They were informed that the sessions would be video-recorded and that their HR would be monitored using both a Garmin wearable and a stationary ECG device. They were also notified that various HRV parameters would be measured via the stationary ECG, and that their electrodermal activity (EDA) would be recorded using a stationary EDA device. Additionally, participants were informed that their participation was voluntary and that they could discontinue treatment at any point without facing any adverse consequences. Aside from receiving the psychological intervention free of charge, participants did not receive any form of compensation.
Participants
Clients were recruited using a campus newsletter. For inclusion, clients had to meet the following criteria: (1) receive a Test Anxiety Inventory (TAI; Spielberger, 1980) score higher than 54; (2) report no imminent risk for suicide, and (3) currently not being in any other form of psychological treatment targeting test anxiety.
Fifty potential clients responded to advertising. Of these, 20 were screened for eligibility and 30 dropped out because they had no further interest in the treatment. One client was excluded because of a TAI score below threshold. Two completed the intake but opted not to join the treatment because of time demands. Seventeen met the inclusion criteria and started the treatment. Of these, two dropped out during the treatment, one after session one and the other after the fourth session. As the data from the dropouts are nevertheless suitable for the validation study, they were not excluded from the analyses.3 The clients differed in terms of their academic field: Law, Education, Computer science, Philosophy, Japanese Studies, Business, Engineering, and others. This non-clinical sample exhibited elevated test-anxiety scores, however, no further diagnostic information was available.
Eleven therapists participated in this study. One therapist was a licensed clinical psychologist, ten therapists were masters’ students in clinical psychology, with no prior psychotherapy experience. The mean number of clients per therapist was 1.5 (range: 1–4). Each therapist underwent training in utilizing the six-session protocol. This training encompassed studying and deliberating the protocol, observing sample videos depicting experienced clinicians employing IR with actors simulating clients experiencing test anxiety, and engaging in role-playing exercises for each of the six sessions. Moreover, therapists participated in weekly group supervision throughout the entirety of the treatment period. The training and supervision were facilitated by an experienced clinical psychologist with significant expertise in conducting IR as well as this treatment protocol. One therapist did not provide written informed consent to evaluate her data for scientific purposes. Therefore, the data of ten therapists and 16 clients were included in the validation analyses. The prediction models were carried out using the data only from the 16 clients. For more participant information, see Table 1.
Table 1
Sample characteristics: demographic variables
Variables
Mean
Range
Age (in years)
25.23
19–35
Academic year
5.36
2–16
n
%
Sex
female
19
73.08
male
7
26.92
Academic degree
none
13
50.00
bachelor
12
46.15
PhD
1
3.85
Marital status
single
14
53.85
in relationship
4
15.38
married
1
3.85
Missings
7
26.92
Intervention Type
Baseline.
A safe place imagery was carried out as a baseline. The 2-min long lasting baseline took place in session 1 within the emotion-focused intervention. From session 2 onwards, it was carried out at the beginning of each session. During the baseline, there was no conversation between the client and therapist; both were instructed to imagine their safe place in silence.
Emotion-focused interventions.
Imagery work was applied as emotion-focused intervention. During imagery work, both the client and therapist were instructed to close their eyes and visualize the imagined situations as vividly as possible. The interventions varied depending on the session; session 1: safe place imagery, session 2: exploration of an aversive situation related to test anxiety, sessions 3 and 4: imagery rescripting of a past situation related to test anxiety; sessions 5 and 6: imagery rescripting of a future situation related to test anxiety (study phase in session 5, and test-taking phase in session 6).
Before each emotion-focused intervention, therapists briefly introduced the respective imagery technique, keeping introductions short to minimize demand effects. Clients were asked to close their eyes, and therapists either did the same for most of the session or turned their chairs to the side to increase client privacy. The emotion-focused interventions began with a body scan to shift attention inward. During the intervention, clients were asked to describe images out loud in the present tense, beginning with an exploration phase in which they described a scene, such as a distressing past test situation. Clients were encouraged to focus on emotions, physical sensations, and thoughts. Once the clients’ experiences became clear and vivid, they were encouraged to take an observer’s perspective and identify what they would have needed in that specific situation. During the rescripting phases of sessions three and four, clients engaged with the situation as their adult selves to address the needs of their ‘vulnerable’ selves. When necessary, the therapists offered guidance or intervened to assist. In sessions five and six, clients attempted new actions in future scenarios, which were often hindered by internal conflicts. They were supported to negotiate these conflicts through a dialogue with the hindering part, with the aim of resolving them through imagery. The emotion-focused interventions aimed to activate experiences related to test anxiety (e.g., sensations, emotions, and cognitions). If successful, this activation was expected to result in an increase of subjective and objective indicators of physiological arousal (e.g., self-reported distress or HR), consistent with findings from in-sensu exposure (Halligan et al., 2006; Holmes & Mathews, 2010; Kozak et al., 1988; Prinz et al., 2019).
Cognitive-oriented interventions.
The cognitive-oriented interventions included psychoeducation on test anxiety (session 1), self-assessment and monitoring of sensations, feelings, cognitions, and behavior in exemplary test-related situations (session 2), identification and restructuring of maladaptive negative thoughts and behaviors linked to test-related scenarios (session 3), and the identification of behavioral goals and adaptive behavior related to studying (session 4) as well as test-taking (session 5). In session 6, the content of all cognitive and behavioral skills learned in the previous sessions were reviewed and recapped. Similar to the emotion-focused interventions, these sessions included dialogues between clients and therapists. After all sessions, clients were assigned home-based practice.
The video-recorded sessions were sighted to manually extract the timestamps of the individual intervention types (i.e., start and endpoint of baseline, emotion-focused, and cognitive-oriented intervention) within each session. These timestamps were used to structure the recorded data per person in terms of the session number and part (i.e., intervention type). There were technical difficulties in video recording in one session. Since the video recordings were required to structure the data, this session was excluded from analyses.
Wearable Indicators of Physiological Arousal
The Garmin Vivosmart 4 wearable is a multisensory activity tracking device that computes data on HR, SR, and other physiological parameters based on PPG. HR was provided every 15 s including a timestamp using PPG by measuring how much light is reflected to the photodiode sensor. Participants were instructed to wear the wearable on the non-dominant hand to avoid movement artifacts due to, for example, writing exercises.
The wearable provided SR for 3-minutes intervals. The calculation of the SR is based on both HR and HRV and takes into account how physically active the participant is, for instance, changing body posture. This can lead to missing SR values. Overall, 7.87% of wearable SRs were missing, with 7.36% missing due to insufficient number of data points and 0.51% missing due to movement. In general, SR comprises values from 0 to 100. According to Garmin, values of 0–25 indicate no stress, 26–50 a low stress level, 51–75 a medium stress level, and 76–100 a high stress level.
Out of 203 recorded sessions, wearable HR and SR for both clients and therapists were missing for one session, and clients’ data for two additional sessions were missing because the wearable was not worn by mistake.4 In addition, the wearable SR was not available for one CBT component as it lasted less than three minutes. Moreover, the wearable SR was only available for four baselines, since most of the baseline lasted less than three minutes.
Stationary Indicators of Physiological Arousal
Lab ECG was recorded with a stationary ECG device (Becker Meditec Karlsruhe, Germany) with a gain of 1230 and sampled with USB-6002 at 500 Hz and 16 Bit resolution and stored as ASCII file. The ECG device utilized three electrodes, two placed on the right and left side of the torso and one placed on the right collar bone (clavicle). HRV data was derived by Kubios HRV premium software (Tarvainen et al., 2014). A visual inspection and manual editing of the data was completed by two graduate students and one postgraduate clinician to ensure proper removal of artifacts and ectopic beats. All editors took part in a training course where about 20% of the sessions were edited and discussed together. The sessions were randomly assigned to the editors. Each editor edited 53 sessions on average, ranging from 47 to 64 sessions. To evaluate editors’ agreement, data of three random sessions were preprocessed by two editors. Then, the ICC was calculated, indicating a very good agreement between the two editors (ICC = 0.849). The ECG data was exported in the smallest possible interval of 30 s.
Lab EDA was recorded with the same device using the constant voltage method (Becker Meditec, Karlsruhe, Germany). The range is 0–100 microS, sensitivity 25mV/ microS. The signal was sampled with 500 Hz (National Instruments multifunction Modul USB-6002) and a resolution of 16 bit with DasyLab V. 10 (National Instruments Ireland Resources, Limited). The signal was down sampled to 25 Hz and stored as an ASCII file. Moving averages and residuals were calculated over 10s intervals for each participant, session, and part separately using the R-package forecast (Hyndman et al., 2024). This was included to smooth the data, which enables the detection of smaller effects and trends. The lab EDA and lab ECG data from 13 sessions were missing due to technical difficulties.
Next-session Outcome
Before each session, clients were asked to assess their symptomatic distress on an 11-item short version of the Hopkins Symptom Checklist-25 (HSCL-11; Lutz et al., 2006). Clients assess on a 4-point Likert scale ranging from 1 (not at all) to 4 (extremely) the degree they suffered from the respective symptoms over the last 7 days. The mean of the 11 items served as a total score.
Analytic Approach
For the analyses a final sample size of N = 159 sessions from N = 16 clients and N = 10 therapists were included.
Assessing reliability.
The agreement between wearable HR/SR and lab ECG HR and lab EDA measurements was calculated based on ICC and the Bland-Altman method. Mean HR in beats per minute was selected as the lab ECG HR value. ICC estimates and their 95% confidence intervals were calculated based on the mean of each measurement method, calculating consistency using a two-way mixed model (i.e., ICC3,k,; Koo & Li, 2016; Shrout & Fleiss, 1979). The model was defined as ICC = (MSBetween - MSError) / MSBetween. According to Cicchetti (1994), an ICC below 0.40 is considered as poor, 0.40–0.59 as fair, 0.60– 0.74 as good and above 0.75 as very good. ICC estimates and 95% CIs were separately calculated for clients, therapists, and intervention types (i.e., baseline, emotion-focused, cognitive-oriented) to test for any systematic differences.
Bland-Altman plots were used to explore mean and paired differences between all measurement methods, separately for HR and SR. Due to different scales of the wearable HR/SR and lab EDA, the variables were z-standardized before Bland-Altman plots were generated. 95% confidence intervals were calculated based on the mean difference serving as intervals of agreement.
RMSSD, the primary measure of vagal activity (Gullett et al., 2023), was chosen as the laboratory ECG HRV metric. To evaluate the associations between wearable HR, SR and lab ECG RMSSD, Pearson correlation coefficients were computed. This method was selected because the ICC is primarily used to assess measurement consistency. As we anticipated negative associations, ICCs were not suitable for this analysis (Bartko, 1979).
For the reliability analyses, one dataset was created containing 30 s intervals of wearable HR, lab ECG HR, lab ECG RMSSD, and lab EDA data. Another dataset was created including all measurements (i.e., wearable HR, wearable SR, lab ECG HR, lab ECG RMSSD, lab EDA) at 180 s intervals.
Assessing the association between client physiological arousal and next-session outcome. To test the association between client indicators of physiological arousal (i.e., lab ECG HR, lab RMSSD, lab EDA, wearable HR, and wearable SR), the type of intervention in each session, and the next-session outcome, we employed a three-step hierarchical linear models (HLM) approach. First, two-level HLMs with sessions nested within clients were applied to separately assess the predictive values of both stationary (lab ECG HR, lab ECG RMSSD, and lab EDA) and wearable (HR and SR) indicators of physiological arousal. Given that wearable SR data were captured in 3-minute intervals, we consistently applied this interval across all analyses. In a first model, the HSCL-11 score for the next session (HSCLc(s+1)) for client c was modeled using client lab ECG HR, lab ECG RMSSD, and lab EDA:
Second, to assess the impact of intervention type on the predictive value of physiological measures, we examined the interactions between intervention types and both stationary and wearable indicators of client physiological arousal using separate HLMs. Intervention types were contrast-coded: the first contrast variable (CV1) distinguished between baseline and both cognitive-oriented and emotion-focused interventions. The second contrast variable (CV2) differentiated cognitive-oriented from emotion-focused interventions. Due to the limited availability of only four baseline values for wearable SR, the analyses were restricted to wearable HR, lab ECG HR, lab EDA, and lab ECG RMSSD, each measured in 30-second intervals. In a first model, we incorporated client lab ECG HR, lab EDA and lab ECG RMSSD, along with their interactions with intervention types, to model the HSCL-11 score for the next session (HSCLc(s+1)) for client c:
y02* CV1 (baseline vs. cognitive-oriented and emotion-focused interventions)
y03* CV2 (cognitive-oriented vs. emotion-focused interventions)
y04* lab ECG HRcs (person-mean centered)
y05* lab ECG HRcs (person-mean centered) * CV1
y06* lab ECG HRcs (person-mean centered) * CV2
y07* lab ECG RMSSDcs (person-mean centered)
y08* lab ECG RMSSDcs (person-mean centered) * CV1
y09* lab ECG RMSSDcs (person-mean centered) * CV2
y10* lab EDAcs (person-mean centered)
y11* lab EDAcs (person-mean centered) * CV1
y12* lab EDAcs (person-mean centered) * CV2
u0c
rcs
In a second model, we included client wearable HR and SR, along with their interactions with intervention types, to model the HSCL-11 score for the next session (HSCLc(s+1)) for client c:
y02* CV1 (baseline vs. cognitive-oriented and emotion-focused interventions)
y03* CV2 (cognitive-oriented vs. emotion-focused interventions)
y04* wearable HRcs (person-mean centered)
y05* wearable HRcs (person-mean centered) * CV1
y06* wearable HRcs (person-mean centered) * CV2
y07* wearable SRcs (person-mean centered)
y08* wearable SRcs (person-mean centered) * CV1
y09* wearable SRcs (person-mean centered) * CV2
u0c
rcs
Third, to examine the impact of measurement methods (i.e., lab ECG vs. wearable device) on the predictive value of the client’s HR across the different intervention types for next-session outcome, a separate HLM model was developed. In this model, the HSCL-11 score for the next session (HSCLc(s + 1)) for client c was modeled using 30-second interval HR data, incorporating three-way interactions between the HR, the measurement method (i.e., lab ECG and wearable), and the intervention types:
We report all data exclusions (if any), all manipulations, and all measures in the study. All preprocessing steps and statistical analyses were performed in R (4.3.1, R Core Team, 2021). The R-scripts are available by emailing the corresponding author. Data will not be available as participants did not consent to the public sharing of their data. This study’s design and its analysis were not pre-registered. This study was approved by the local research ethics committee (Nr. 01/2020, ethics committee of Trier University).
Results
The overall mean in lab ECG HR was M = 83.21 bpm (SD = 6.10; range = 48.68–146.76), in lab EDA M = 19.83 µS (SD = 10.36, range = 1.01–87.86), in lab RMSSD M = 40.11 ms (SD = 24.29, range = 2.55–449.36)5, in wearable HR M = 79.10 bpm (SD = 11.18; range = 43–115) and in wearable SR M = 33.30 (SD = 24.71; range = 0–94). Table 2 displays the means and standard deviations in stationary and wearable parameters for each intervention type, for clients and therapists separately.
Table 2
Mean and SD values for each method, role and intervention type
Device
Role
Intervention
Mean
SD
Lab ECG HR
Clients
BL
84.56
6.55
E
86.24
6.83
C
84.38
6.54
Therapists
BL
81.24
5.52
E
80.90
5.20
C
80.99
6.03
Lab EDA
Clients
BL
13.41
7.11
E
20.19
10.60
C
21.38
8.71
Therapists
BL
14.48
9.14
E
19.69
10.91
C
19.57
10.36
Lab RMSSD
Clients
BL
44.55
33.80
E
39.59
22.82
C
40.01
22.32
Therapists
BL
48.11
43.43
E
39.30
20.74
C
40.79
28.60
Wearable HR
Clients
BL
84.99
8.67
E
81.09
11.07
Cara>
81.20
10.55
Therapists
BL
79.54
11.08
E
76.07
10.79
C
78.14
11.50
Wearable SR
Clients
BL
35.12
26.68
E
37.15
25.65
C
34.69
22.27
Therapists
BL
29.61
24.74
E
31.28
25.01
C
35.09
25.68
BL = Baseline, E = emotion-focused intervention, C = cognitive-oriented intervention. HR means and SD are reported in beats per minute (bpm), EDA in microsiemens (µS), and RMSSD in milliseconds (ms)
Paired sample t-tests were conducted separately for clients and therapists to compare measurements between baseline and cognitive-oriented as well as emotion-focused interventions. No significant differences were observed between baseline and cognitive-oriented interventions for lab ECG HR (Clients: t(15) = 0.46, p =.651; Therapists: t(15) = 1.26, p =.228), lab ECG RMSSD (Clients: t(15) = 0.75, p =.426; Therapists: t(15) = 1.28, p =.221), or wearable HR (Clients: t(15) = 1.96, p =.069; Therapists: t(15) = 1.58, p =.134). However, clients’ lab EDA was significantly higher during cognitive-oriented interventions compared to baseline (t(15) = − 6.27, p <.001, d = − 1.62), whereas the corresponding comparison for therapists was not significant (t(15) = − 1.77, p =.098). Significant differences were observed between baseline and emotion-focused interventions. Client’s lab ECG HR was significantly higher during emotion-focused interventions compared to baseline (t(15) = − 2.70, p =.016, d = − 0.70), while this effect was not significant for therapists (t(15) = 0.65, p =.527). Therapists showed significantly higher lab ECG RMSSD at baseline compared to emotion-focused interventions (Therapists: t(15) = 2.67, p =.017, d = 0.69), whereas clients did not (t(15) = 1.31, p =.209). The wearable HR showed significantly higher HR for therapists at baseline compared to emotion-focused interventions (t(15) = 2.15, p =.049, d = 0.55), but this effect was not significant for clients (t(15) = 1.78, p =.095). Client’s lab EDA was significantly higher during emotion-focused interventions compared to baseline (t(15) = − 6.03, p <.001, d = − 1.56). For therapists, this difference did not reach significance (BL vs. E: t(15) = − 1.47, p =.161).
Intra-class Correlation Wearable HR, Lab ECG HR and Lab EDA
Based on Cicchetti (1994), the overall agreement between wearable HR and lab ECG HR can be interpreted as very good, ICC =.794 (95% CI [.787 −.801]). The ICC estimates were calculated separately for clients and therapists with slightly higher ICC for therapists (ICC =.818, 95% CI [.808 −.826]) compared to clients (ICC =.747, 95% CI [.734–.760]). In addition, ICCs were separately calculated for each intervention type, with the lowest ICC estimated for the baseline (ICC =.626, 95% CI [.561–.682]), followed by the emotion-focused (ICC =.786, 95% CI [.776–.796]) and cognitive-oriented intervention (ICC =.840, 95% CI [.830–.849]). The ICC between wearable HR and lab EDA was poor (Cicchetti, 1994) across clients, therapists and intervention types. Table 3 provides an overview of the ICCs including the F-tests.
Table 3
Overview ICC between wearable HR and lab ECG HR within 30s intervals
95% Confidence Interval
F-test with true value 0
ICC
Lower Bound
Upper Bound
Value
df1
df2
Sig.
Wearable HR and lab ECG HR
All
.794
.787
.801
4.85
12329
12329
<.001
Clients
.747
.734
.760
3.96
6067
6067
<.001
Therapists
.818
.808
.826
5.48
6261
6261
<.001
BL
.626
.561
.682
2.68
599
599
<.001
E
.786
.776
.796
4.68
7546
7546
<.001
C
.840
.830
.849
6.24
4182
4182
<.001
Wearable HR and lab EDA
All
.264
.238
.290
1.36
12329
12329
<.001
Clients
.222
.182
.260
1.28
6067
6067
<.001
Therapists
.286
.250
.321
1.40
6261
6261
<.001
BL
.365
.254
.459
1.57
599
599
<.001
E
.299
.267
.330
1.43
7546
7546
<.001
C
.221
.173
.267
1.28
4182
4182
<.001
BL = Baseline, E = emotion-focused, C = cognitive-oriented
As can be seen in Fig. 1, the ICC estimate was higher in cognitive-oriented interventions compared to baseline and emotion-focused interventions as well as in emotion-focused interventions compared to baseline. Moreover, the ICC estimate was higher in therapists compared to clients. Between ICC estimates for wearable HR and lab EDA, no differences could be identified (see Online Resource 1).
Fig. 1
Wearable HR and lab ECG HR ICC comparisons for each intervention type and role
×
Note
Depicted are ICC estimates with error bars showing their 95% confidence intervals. (a) Baseline (BL): n = 600, emotion-focused (E): n = 7547, cognitive-oriented (C): n = 4183; (b) Client: n = 6068, Therapist: n = 6262.
Differences between Wearable HR, Lab ECG HR, and Lab EDA
Bland-Altman plots were utilized to assess the agreement between the wearable and laboratory devices (see Fig. 2). In doing so, pairwise differences between the wearable HR and lab ECG HR are shown on the y-axis, and the mean HR derived from both methods is displayed on the x-axis (i.e., [(HR wearable + HR lab ECG)/2]).
Fig. 2
Bland-Altman plot for wearable HR and lab ECG HR
×
Note
The Bland-Altman plot for the HR shows the difference between the two measurement methods (wearable and lab ECG) plotted against their mean values. The central dotted line represents the mean difference (i.e., measurement bias), while the upper and lower dotted lines indicate the limits of agreement, representing the range within which most differences between the two methods are expected to fall, assuming normal distribution.
The mean measurement bias between wearable and lab ECG HR was − 4.11, with − 23.17 as the lower limit of agreement and 14.95 as the upper limit of agreement. This negative bias indicates that the wearable device measures HR 4.11 bpm lower than the lab ECG.6 Bland-Altman plots were also calculated to compare wearable HR and lab EDA values using z-standardized values due to the different scales. Lab EDA values were higher than wearable HR values (see Online Resource 2).
Correlation between Wearable HR and Lab ECG RMSSD
To quantify the association between wearable HR and lab ECG RMSSD, a Pearson’s correlation coefficient was calculated. A moderate negative correlation was found (r =–.42, t(12161) = − 50.78, p <.001), indicating that higher wearable HR values were associated with lower values in lab ECG RMSSD. Table 4 provides an overview of the correlation coefficients across roles and intervention types.
Table 4
Overview Pearson’s correlation coefficients between wearable HR and lab ECG RMSSD within 30s intervals
95% Confidence Interval
Role
r
Lower Bound
Upper Bound
Sig.
All
–.418
–.433
–.403
<.001
Clients
–.473
–.492
–.453
<.001
Therapists
–.384
–.405
–.362
<.001
BL
–.177
–.253
–.098
<.001
E
–.474
–.491
–.456
<.001
C
–.419
–.444
–.393
<.001
As shown in Fig. 3, Pearson’s correlation coefficients were more strongly negative during cognitive-oriented and emotion-focused interventions compared to baseline, and in emotion-focused compared to cognitive-oriented interventions. Furthermore, the correlation coefficients were notably more negative for clients compared to therapists.
Fig. 3
Comparison of Pearson’s correlation coefficients for wearable HR and lab ECG RMSSD across intervention types and roles
×
Note
Depicted are Pearson’s correlation coefficients with error bars showing their 95% confidence intervals. (a) Baseline (BL): n = 600, emotion-focused (E): n = 7547, cognitive-oriented (C): n = 4183. (b) Client: n = 6068, Therapist: n = 6262.
Intra-class Correlation between Wearable SR, Lab ECG HR, and Lab EDA
Based on Cicchetti (1994), the overall agreement between wearable SR and lab ECG HR was good, ICC =.603 (95% CI [.563–.640]). Between wearable SR and wearable HR, the overall agreement was very good, ICC =.751 (95% CI [.725–.774]). However, the overall agreement between wearable SR and lab EDA was poor, ICC =.039 (95% CI [–.058–.127]). Table 5 provides an overview of the ICC estimates for each role and intervention type separately, additionally including the F-tests.
Table 5
Overview of ICC estimates between wearable SR and lab ECG and lab EDA
95% Confidence Interval
F Test with True Value 0
ICC
Lower Bound
Upper Bound
Value
df1
df2
Sig
SR and lab ECG HR
All
.603
.563
.640
2.52
1655
1655
<.001
Clients
.535
.467
.595
2.15
819
819
<.001
Therapists
.645
.593
.690
2.82
835
835
<.001
BL
.768
–8.05
.994
4.31
2
2
.188
E
.572
.518
.619
2.33
1106
1106
<.001
C
.671
.611
.722
3.04
545
545
<.001
SR and wearable HR
All
.751
.725
.774
4.01
1655
1655
<.001
Clients
.714
.671
.750
3.49
819
819
<.001
Therapists
.775
.742
.804
4.44
835
835
<.001
BL
.765
–8.17
.994
4.25
2
2
.190
E
.745
.713
.773
3.92
1106
1106
<.001
C
.763
.720
.800
4.22
545
545
<.001
SR and lab EDA
All
.039
–.058
.127
1.04
1655
1655
.208
Clients
.057
–.081
.178
1.06
819
819
.199
Therapists
.010
–.134
.135
1.01
835
835
.444
BL
–.064
–40.48
.973
0.94
2
2
.515
E
.012
–.111
.122
1.01
1106
1106
.418
C
.100
–.066
.238
1.11
545
545ara>
.112
Differences in ICC estimates for wearable SR across roles and intervention types are visualized in Online Resource 3 and 4. Since Garmin SR were only available for four baseline interventions, the baseline was excluded from the comparison. No meaningful differences in ICC estimates were observed across methods, roles, and intervention types.
Differences between Wearable SR, Lab ECG HR, and Lab EDA
Bland-Altman plots were utilized to assess the agreement between the wearable SR and each of the following metrics: lab ECG HR, lab EDA, and wearable HR. For the comparison between wearable SR and lab EDA, z-standardized values were used. Pairwise differences between z-standardized wearable SR and lab EDA are shown on the y-axis, and the means stress indicator derived from both methods is displayed on the x-axis (i.e., [(zwearable SR + zEDA)/ 2]). All Bland-Altman plots showed a mean bias around zero with varying limits of agreement (see Online Resource 5).
Correlation between Wearable SR, and Lab ECG RMSSD
To quantify the association between wearable SR and lab ECG RMSSD, Pearson’s correlation coefficient was calculated. Consistent with the results for wearable HR, a moderate negative correlation was found (r =–.41, t(1626) = − 18.02, p <.001). This indicates that higher values of wearable SR were associated with lower lab ECG RMSSD values. The correlation coefficients did not differ between intervention types and roles (see Online Resources 6 and 7).
Association between Different Indicators of Physiological Arousal and Next-session Outcome
Pearson’s correlation coefficients were calculated between client prior session HSCL values and wearable and stationary indicators of physiological arousal with next-session HSCL values for each intervention type separately. Table 6 provides detailed information.
Table 6
Overview of Pearson’s correlation coefficients between physiological measurements and next-session outcome
Intervention
next-session HSCL
r
p
HSCL
prior session
.550
<.001
lab ECG HR
I
.420
<.001
CBT
.454
<.001
lab EDA
I
–.098
.458
CBT
–.024
.868
lab ECG RMSSD
I
–.283
.029
CBT
–.269
.056
wearable HR
I
.397
.002
CBT
.354
.011
wearable SR
I
.137
.295
CBT
.236
.095
r = Pearson’s correlation coefficient. HSCL was assessed once per session. Baselines were excluded since only four datapoints were available for wearable SR
The HLM model with next-session HSCL as the criterion variable and with session-level client lab EDA, lab ECG HR, and lab ECG RMSSD revealed that only client lab ECG HR was positively associated with client next-session HSCL levels (b = 0.02, 95% CI [0.01–0.04], p =.001). For detailed information see Table 7.
Table 7
Next-session outcome on the HSCL predicted by prior session levels of stationary indicators of client physiological arousal (180s)
HSCL
b [95% CI]
p
Intercept
1.75 [1.58–1.93]
<.001
Client prior session HSCL
–0.06 [−0.29–0.17]
.618
Client prior session lab ECG HR
0.02 [0.01–0.04]
.001
Client prior session lab EDA
–6.56 [−73.26–60.14]
.846
Client prior session lab RMSSD
0.01 [−0.00–0.02]
.126
HSCL = Hopkins Symptom Checklist
The HLM model with next-session HSCL as the criterion variable and with session-level of client wearable SR and wearable HR as predictors revealed that only client wearable HR was positively associated with client next-session HSCL levels (b = 0.02, 95% CI [0.01–0.03], p =.003). For details see Table 8.
Table 8
Next-session outcome in HSCL predicted by prior session levels of wearable indicators of client physiological arousal (180s)
HSCL
b [95% CI]
p
Intercept
1.76 [1.56–1.96]
<.001
Client prior session HSCL
0.02 [−0.20–0.25]
.829
Client prior session wearable HR
0.02 [0.01–0.03]
.003
Client prior session SR
–0.00 [−0.01–0.00]
.166
HR = Heart Rate, SR = Stress Rate. Only emotion-focused and cognitive-oriented interventions were included, as only four SR measurements were available for baseline
Furthermore, the predictive value of client stationary and wearable physiological arousal indicators was unaffected by the types of session interventions. For detailed results, see Table 9.
Table 9
Next-session outcome on the HSCL predicted by prior session levels of stationary and wearable indicators of client physiological arousal and interactions with intervention type (30 s)
HSCL
b [95% CI]
p
Stationary indicators
Intercept
1.74 [1.56–1.93]
<.001
Client prior session HSCL
0.01 [–0.17–0.19]
.944
CV1 [BL vs. C and E]
–0.00 [–0.07–0.07]
.977
CV2 [C vs. E]
0.01 [–0.06–0.07]
.833
Client prior session lab ECG HR
0.01 [0.01–0.02]
.003
Client prior session lab EDA
32.66 [–18.36–83.68]
.208
Client prior session lab RMSSD
–0.00 [–0.00–0.01]
.129
Client prior session lab ECG HR * CV1
–0.01 [–0.02–0.00]
.205
Client prior session lab ECG HR * CV2
0.01 [–0.00–0.02]
.170
Client prior session lab EDA * CV1
–19.10 [76.25–38.05]
.510
Client prior session lab EDA * CV2
48.03 [−25.57–121.63]
.199
Client prior session lab RMSSD * CV1
–0.00 [−0.01–0.00]
.534
Client prior session lab RMSSD * CV2
0.00 [–0.01–0.01]
.540
Wearable indicators
Intercept
1.75 [1.56–1.94]
<.001
Client prior session HSCL
0.06 [–0.12–0.23]
.529
CV1 [BL vs. C and E]
–0.02 [−0.09–0.06]
.674
CV2 [C vs. E]
0.01 [–0.05–0.07]
.777
Client prior session wearable HR
0.01 [0.01–0.02]
<.001
Client prior session wearable HR * CV1
–0.00 [–0.01–0.01]
.852
Client prior session wearable HR * CV2
–0.01 [−0.01–0.01]
.608
BL = Baseline, C = cognitive-oriented interventions, E = emotion-focused interventions
Additionally, the predictive value of client HR was unaffected by the measurement method (i.e., lab ECG vs. wearable) across the intervention types. For detailed results, refer to Table 10. Client prior session wearable and lab HR as well as their combination remained significant predictors of next-session HSCL-11 (lab ECG HR: b = 0.01, 95% CI [0.01–0.02], p =.003; wearable HR: b = 0.01, 95% CI [0.01–0.02], p <.001; HR combined: b = 0.02, 95% CI [0.00–0.03], p =.022; Tables 8 and 9).
Table 10
Next-session outcome on the HSCL predicted by prior session levels of client HR and interaction between measurement method and intervention type (30 s)
HSCL
b [95% CI]
p
Intercept
1.76 [1.54–1.97]
<.001
Client prior session HSCL
0.05 [–0.07–0.17]
.391
Client prior session HR
0.02 [0.00–0.03]
.022
Method [lab ECG vs. wearable]
–0.01 [–0.07–0.06]
.879
CV1 [BL vs. C and E]
–0.03 [−0.18–0.13]
.736
CV2 [C vs. E]
0.01 [−0.12–0.14]
.886
Client prior session HR * Method
–0.00 [–0.02–0.01]
.356
Client prior session HR * CV1
0.00 [–0.02–0.03]
.686
Client prior session HR * CV2
–0.00 [–0.02–0.02]
.929
Method * CV1
0.01 [–0.09–0.11]
.826
Method * CV2
–0.00 [–0.08–0.08]
.982
Client prior session HR * Method * CV1
–0.01 [–0.02–0.01]
.454
Client prior session HR * Method * CV2
0.00 [–0.01–0.01]
.614
BL = Baseline, C = cognitive-oriented interventions, E = emotion-focused interventions
Discussion
The present study aimed to validate Garmin wearables comparing their HR and SR measurements to gold-standard stationary ECG and EDA in a psychotherapeutic context, while also exploring the impact of session content, in the sense of the degree of emotional activation, on measurement accuracy. Additionally, we investigated the predictive validity of client physiological arousal indicators, measured either with stationary or wearable devices, for next-session outcome, accounting for potential impacts of the intervention type. In summary, wearable HR demonstrated overall very good agreement with laboratory-derived HR. However, wearables tended to underestimate HR compared to laboratory ECG. Notably, client HR was predictive of next-session outcome, irrespective of the measurement method or intervention type employed.
In line with part A of our first hypothesis, the agreement between wearable HR and lab ECG HR, calculated across clients, therapists and intervention types, was very good (Cicchetti, 1994). When considering session content in the sense of the degree of emotional activation, the agreement between wearable HR and lab ECG HR was higher in cognitive-oriented interventions compared to baseline and emotion-focused interventions. Although the accuracy of wearable devices in measuring HR seems to be lower in emotion-focused interventions and therefore states of higher emotional activation (see limitations for further details), the agreement with laboratory ECG is still very good based on typical conventions (Cicchetti, 1994). These findings align with the result that the agreement between the devices measuring HR was higher for therapists than for clients, potentially reflecting greater emotional activation in clients. Both measurement methods revealed higher HR for clients compared to therapists, with exploratively calculated t-tests confirming these differences as statistically significant for both wearable HR (t(12328) = 22.17, p <.001) and lab ECG HR (t(12319) = 21.01, p <.001). Additionally, clients’ lab ECG HR and lab EDA were significantly higher in emotion-focused interventions compared to baseline, further indicating increased emotional activation in these session parts. However, the ICC was significantly higher in emotion-focused interventions compared to baseline interventions, which contradicts the assumption of accuracy loss due to emotional activation. Since the baseline only lasted around two minutes, comparatively less data points were accessible for this intervention type in comparison to the emotion-focused and cognitive-oriented interventions, which in turn decreased statistical power. Given these inconsistencies, the assumption that emotional activation could impact the accuracy of wearable HR measurements warrants further investigation (see limitations for details). In line with previous studies (for an overview see Evenson & Spade, 2020), the Bland-Altman analysis indicated a systematic difference between the two HR measurement methods, with the wearable HR tending to underestimate HR compared to the gold-standard laboratory device.
As several studies have found a significant association between HR and EDA, as indicators of the ANS, a moderate correlation between these two parameters was expected in part B of our first hypothesis (e.g., Kettunen et al., 1998; Lazarus et al., 1963). Contrary to our hypothesis, we only found a poor agreement between wearable HR and lab EDA, with no differences observed across roles or intervention types. However, this lack of association does not appear to be due to inaccuracies in the wearable device, as only a poor correlation was found between laboratory-assessed HR and EDA either (for detailed results, see Online Resource 8). A possible explanation for this result could be that the measurement interval of 30 s was too broad. Kettunen et al. (1998) highlights the importance of small measurement intervals (i.e., < 10 s) to detect associations between HR and EDA parameters. Furthermore, studies reporting such an association between HR and EDA were usually conducted in well-controlled experimental contexts, contrasting with the naturalistic setting of our study.
Consistent with part C of our first hypothesis, we observed a moderate negative correlation between wearable HR and lab ECG RMSSD. This correlation was more pronounced during emotion-focused interventions compared to both baseline and cognitive-oriented interventions. Additionally, correlation coefficients were more strongly negative for clients compared to therapists. Given the anticipated heightened physiological arousal in clients, especially during emotion-focused interventions, these findings support the prevailing hypothesis of an inverse relationship between sympathetic and parasympathetic activities in states of increased physiological arousal.
Parts A and B of our second hypothesis, which anticipated a moderate ICC between the wearable SR, lab ECG HR and stationary EDA, were only partially confirmed. We observed a good agreement between wearable SR and lab ECG HR, but only poor agreement between wearable SR and lab EDA. This result might again be due to a too broad measurement interval (i.e., 180 s) and the naturalistic setting of the study. However, our finding that the ICC between wearable SR and lab ECG HR or wearable HR was good to very good aligns with our hypothesis and accordingly indicates that SR might also be an index of the sympathetic and parasympathetic branch of the ANS. This is further supported by our observation that wearable SR was negatively correlated with lab ECG RMSSD, as we expected in part C of our second hypothesis. In contrast to the wearable HR findings, no differences in wearable SR ICCs and Pearson’s correlation coefficients across intervention types or roles were observed. Moreover, Bland-Altman analyses indicated no systematic differences between wearable SR and lab ECG HR and lab EDA. However, these analyses may lack statistical power, since wearable SR is only provided within three-minute intervals, resulting in less data point compared to the HR analyses.
According to our third, primarily exploratory hypothesis, we anticipated that both client stationary and wearable indicators of physiological arousal during emotion-focused interventions, but not during baseline or cognitive-oriented interventions, would be associated with session-level outcomes. Contrary to that, the sole predictors of next-session HSCL values were average client lab ECG HR and wearable HR, with no impact of the intervention type. Higher levels of physiological arousal, as indicated by HR, were associated with higher next-session symptom severity (for a similar finding see Halligan et al., 2006). Crucially, the predictive utility of HR did not differ between laboratory ECG measurements and wearable devices. Lab EDA, lab ECG RMSSD, wearable SR and intervention type were not related to next-session symptom severity. This finding partially aligns with Prinz et al. (2022), who found no significant association between average physiological arousal (indicated by EDA measurements) and next-session outcome. The authors suggest that physiological arousal, measured by indices derived solely from the sympathetic branch of the autonomic nervous system, may reflect simple distress and no deep or meaningful emotional processing. This could similarly apply to wearable SR, especially given the sparse information on how SR is calculated and the potential for emotional dynamics to be obscured when averaging SR over 3-minute segments. Furthermore, RMSSD, which predominantly reflects the parasympathetic branch’s regenerative capacity (Gullett et al., 2023), does not account for sympathetic activation, thereby providing an incomplete picture of physiological arousal. Our findings suggest that a comprehensive analysis incorporating parameters that reflect influences from both the sympathetic and parasympathetic branches of the ANS, such as heart rate (Thayer et al., 2009), may offer a more holistic understanding of physiological responses and psychotherapeutic change processes. Nevertheless, this assumption requires further investigation.
Limitations and Future Directions
Regarding the HR parameter, our findings indicate a very good agreement with the gold-standard stationary ECG. However, wearable HR values were consistently slightly lower than those derived from the lab ECG, with a considerable variability as indicated by the relatively large limits of agreements in the Bland-Altman plot. Nonetheless, the acceptable range for agreement depends on the research context. For example, in laboratory settings where precision is critical, the observed level of variability in wearables HR may be unacceptable. However, in natural psychotherapeutic settings, where wearables are used for general monitoring or trend analysis (e.g., observing changes over time), this level of agreement might be acceptable for being able to investigate specific research questions that could not be investigated otherwise due to the impracticality of stationary equipment. A possible limitation affecting the accuracy of wearable HR measurement in our study are subtle (arm) movements (Collins et al., 2019; Evenson & Spade, 2020). Both, clients and therapists, remained seated in chairs that did not swivel throughout all interventions and wore the wearables on their non-dominant hand. Nevertheless, mandatory note-taking, which was required in some cognitive-oriented sessions could have led to movement artifacts causing inaccuracy in wearables measurements. We observed a slightly lower ICC for HR in clients compared to therapists and during emotion-focused interventions, which potentially indicates a potential impact of emotional activation on wearables measurement accuracy. Alternatively, this finding could also reflect more pronounced movements in clients during emotion-focused sessions rather than higher degrees of emotional activation (Wilhelm et al., 2010). Laboratory ECG and laboratory EDA detected significantly higher HR values in clients during emotion-focused interventions compared to baseline — a difference not captured by the wearable. This discrepancy could again either reflect a wearable’s reduced accuracy in states of emotional activation or movement-induced artifacts interfering with wearable measurements. Supporting the latter assumption, therapist’s wearable HR was significantly higher during baseline compared to emotion-focused interventions, while lab ECG HR showed no such difference. Furthermore, lab ECG RMSSD revealed higher values and therefore higher parasympathetic activity in therapists during baseline compared to emotion-focused interventions, contradicting the wearable HR finding that therapists had a higher HR during baseline. These inconsistencies further suggest that movement artifacts may influence the accuracy of wearable HR measurement. Future studies should therefore consider wearables that can track subtle movements and include such data as covariates in analyses. Additionally, the level of speech varied across intervention types, being absent during baseline and varying during cognitive-oriented and emotion-focused interventions. This variation was not analyzed in the current study. Given that speech may have influenced HR, future research should consider analyzing this activity during interventions and integrating it into analyses as a potential covariate.
Regarding the SR parameter, we observed good agreement with the gold-standard stationary ECG and poor agreement with the gold-standard stationary EDA measurements. The SR is only provided for three-minute intervals. In our study, the baseline interventions of the manualized treatment protocol were approximately two minutes in duration, resulting in only four baseline SR measurements. This limited sampling precludes any robust conclusions about physiological dynamics across baseline and both emotion-focused and cognitive-oriented interventions. However, even discarding the baseline in our data, the evaluation of the SR parameter raises some concerns. The proprietary nature of Garmin’s algorithm for calculating SR further challenges its application in research. The lack of transparency in the SR calculation process (e.g., specific HRV parameter) and the potential for changes in the algorithm with new device releases hinder long-term replications of studies using this SR parameter. Moreover, recent research by Siepe et al. (2024) involving a large sample (n = 781, t = 85 days) indicated that the SR did not correlate well with self-reported stress levels, questioning the actual captured construct of this SR parameter. Given these findings, researchers should critically assess the added value of the SR parameter in psychotherapeutic research, considering the current limitations and the opaque methodology used in its calculation.
Regarding predictive validity, we found a positive association between client HR and the severity of symptoms in the next session, consistent across different measurement methods and types of interventions. However, our study was not pre-registered and did not include an a priori power analysis for the prediction model. Moreover, the results are based on a small, homogeneous sample of students with heightened test anxiety, which limits the generalizability of our findings. Additionally, our analysis did not consider person-specific, dynamic, bidirectional interactions between wearable HR measurements and outcome measures. Given that wearable HR has shown potential validity, future research should examine the individual and reciprocal effects of HR and other physiological parameters on psychotherapeutic processes and outcomes. Personalized dynamic network models could be utilized to explore these interactions further (e.g., Hofmann et al., 2020; Fisher et al., 2017). The high-resolution data provided by wearables offer a unique opportunity to investigate the temporal relationships between HR and individual change processes during psychotherapy (Bommer et al., 2024). This approach could enhance our understanding of how physiological responses are intertwined with therapeutic progress.
Conclusion
This study contributes to the growing body of research exploring the role of physiological dynamics in understanding psychotherapeutic change processes. Our findings demonstrate moderate to high associations between wearable HR and wearable SR with laboratory ECG HR. Furthermore, both stationary and wearable HR emerged as significant predictors of client next session outcome, regardless of the intervention type. This finding suggests that incorporating parameters reflecting both activating and regulating information could contribute to a more comprehensive understanding of physiological responses and psychotherapeutic change processes. While HR measured by wearables showed comparable accuracy and predictive validity to stationary HR measurements, it is important to note that wearable-derived HR values were consistently lower than those from laboratory ECG. Researchers should carefully evaluate the suitability of wearable-derived measurements for their specific research objectives, considering potential influences on measurement accuracy from subtle movements.
Declarations
Competing Interests
The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
share
DELEN
Deel dit onderdeel of sectie (kopieer de link)
Optie A:
Klik op de rechtermuisknop op de link en selecteer de optie “linkadres kopiëren”
Met BSL Psychologie Totaal blijf je als professional steeds op de hoogte van de nieuwste ontwikkelingen binnen jouw vak. Met het online abonnement heb je toegang tot een groot aantal boeken, protocollen, vaktijdschriften en e-learnings op het gebied van psychologie en psychiatrie. Zo kun je op je gemak en wanneer het jou het beste uitkomt verdiepen in jouw vakgebied.
In the Bland-Altman analysis, one notable outlier was observed in the upper left corner of the plot, indicating a significantly higher wearable HR compared to the lab ECG HR. This can be traced back to a measurement error in the lab ECG. The mean bias and limit values did not change meaningfully after the exclusion of this outlier (mean bias: − 4.12, lower limit: − 23.13, upper limit: 14.90).