Introduction
Autism Spectrum Disorder (ASD) is a neurodevelopmental condition affecting 1 in 36 children in the U.S. (CDC,
2022; Mughal et al.,
2024). ASD encompasses a wide range of cognitive, social, and communicative challenges, which have a significant impact on the child, their family and society in general, including areas such as education, the healthcare system, and employment.
Early intervention during the initial years of life emerges as a pivotal factor in the management of ASD, owing to its potential for favorable outcomes facilitated by neuroplasticity (Dawson,
2008). Unfortunately, in a majority of cases, ASD is not detected until after the age of 4 (Baio et al.,
2018), leading to a missed critical window for early stimulation. The subtle nature of early signs of autism presents challenges for both clinicians and caregivers. Therefore, any behavioral cues that raise concerns should be diligently combined with clinical assessments to enable the early identification of atypical development. This proactive approach enables the timely implementation of intervention strategies, thereby improving the child’s long-term prognosis and substantially elevating their overall quality of life.
Over the last decade, numerous studies have leveraged Artificial Intelligence (AI) to analyze early signs of ASD in children using various neurophysiological signals, such as eye-tracking (Jones et al.,
2023), electroencephalograms (Gabard-Durnam et al.,
2019), magnetic resonance imaging (Shen et al.,
2022), and functional near-infrared spectroscopy (Conti et al.,
2022), to detect potential indicators of ASD before behavioral symptoms appear.
However, the techniques mentioned above are considered reliable but involve complex clinical procedures, implying substantial hospital expenditures (Okoye et al.,
2023). This fact combined with the recent success of top Deep Learning (DL) techniques applied to much more accessible and low-cost unstructured data signals, has shown promising potential in the domain of early ASD detection (Kim et al.,
2023; Kojovic et al.,
2021; Manigault et al.,
2023). In this context, cry analysis has emerged as a compelling approach for early ASD detection due to its accessibility, non-invasive nature, cost-effectiveness, and ease of recording in both clinical and home settings. It allows for longitudinal assessment, enabling researchers to track developmental changes over time, and is strongly associated with neurodevelopmental conditions (Esposito et al.,
2017; Oren et al.,
2016). Infant cries provide a unique window into the neurological and physiological state of the infant (Laguna et al.,
2023; Laguna, Pusil, Bazán, Laguna et al.,
2023a,
b; Orlandi et al.,
2012), offering the potential to identify early markers of ASD through acoustic features. Research has identified atypical cry patterns in toddlers with ASD under 18 months, with differences observed in acoustic features such as jitter, shimmer, harmonic-to-noise ratio (HNR), and fundamental frequency (F0) (Orlandi, Manfredi, Orlandi et al.,
2012a,
b; Santos et al.,
2013; Sheinkopf et al.,
2012; Unwin et al.,
2017). Machine learning (ML) algorithms have also been applied to classify these cries, revealing its promising predictive value as a vocal early indicator of ASD (Khozaei et al.,
2020; Manigault et al.,
2023).
In this study, our primary objective is to determine the distinctive acoustic characteristics present in cries from children aged 18 to 54 months, comparing those with TD to those with ASD. Moreover, we aim to assess the potential application of cry analysis using DL techniques to support clinicians in the early detection of ASD. Empowering clinicians with automatic, non-invasive AI-driven cry vocal biomarkers tools presents a compelling avenue for enhancing early detection endeavors. This has the potential to significantly advance early intervention strategies, leading to more timely and precisely targeted support for children at risk of ASD, thereby enhancing their developmental trajectories.
Methods
Participants
The study participants were drawn from the cry dataset from (Khozaei et al.,
2020), which encompassed a total of 62 individuals aged between 18 and 54 months. This cohort was divided into two distinct groups: 31 individuals diagnosed with ASD and 31 TD individuals. Within each group, there were 24 boys and 7 girls. The average ages for the ASD and TD groups were 35.6 and 30.8 months, respectively. The autism diagnosis procedure started with the Gilliam Autism Rating Scale-Second Edition (GARS-2) questionnaire (Samadi & McConkey,
2014) which was answered by the parents. Then the caregivers were interviewed, based on the Diagnostic and Statistical Manual of Mental Disorders, 5th edition (DSM-5) (Wiggins et al.,
2019), while the participants were evaluated and observed by two Ph.D. degree child clinical psychologists. In addition, the diagnosis of ASD was separately confirmed by at least a child psychiatrist in a different setting. It is important to note that the official version of Autism Diagnostic Observational Schedule (ADOS) is not available for Farsi. Thus, there are different approaches taken to evaluate participants in Iran (Samadi & McConkey,
2014).
Data Collection
As explained by (Khozaei et al.,
2020), data was recorded using high-quality devices (74.2%, Sony UX560 and UX512F voice recorders) and smartphones with a custom voice-recording application (25.80%). Recordings were made in WAV format (16-bit, 44.1 kHz) to ensure consistency across devices. A variety of devices and recording locations such as homes (12.90% ASD sample, 45.16% TD sample), autism centers (87.10% ASD sample) and health centers (54.84% TD sample) were used to avoid bias and increase generalizability. Parents and trained voice collectors were instructed to record in quiet environments with the devices held approximately 25 cm from the participant’s mouth. Recordings not meeting these conditions, as well as cries associated with pain, were excluded. The reasons for crying differed between the ASD and TD groups, reflecting distinct behavioral and emotional triggers. In both groups, the most common causes were related to complaining or discomfort (74.20% for ASD group and 67.75% for the TD group), while other factors such as sleepiness, hunger, and anxiety (25.80% for the ASD group and 32.25% for the TD group) also contributed to the crying episodes. In the preprocessing phase, only pure crying sounds were retained, with segments containing other vocalizations or non-empty mouth cries eliminated. Finally, the average number of cry instances per infant per group includes for the ASD group: 6.10 ± 5.05 instances, while the TD group showed 5.39 ± 3.66 instances. For more details, see Supplementary Material Table
1 S.
Ethical Considerations
The study protocol received approval from the ethics committee at Shahid Beheshti University of Medical Sciences and Health Services, Tehran (Iran). Prior to enrollment, comprehensive informed consent was acquired from the parents or legal guardians of the participants. This ensured they were well-informed about the study’s objectives, procedures, and possible advantages and risks.
Procedures
For this publicly available dataset, we extracted a range of pitch-based audio features for each cry pattern using Praat software (Boersma,
2002). The total number of frames depends on both the audio duration and the time step used. Perturbation vocal metrics, including jitter, shimmer, and HNR, were chosen due to their widespread use in clinical contexts (Meghashree & Nataraja,
2019; Teixeira et al.,
2013). Jitter measures pitch variability over time by calculating the mean absolute difference between consecutive pitch periods, using a period range from 0.0001 to 0.02 s and a pitch floor of 1.3 to eliminate low-frequency noise. Shimmer quantifies the amplitude variation between successive pitch periods, applying the same parameters as jitter but with an added amplitude ratio of 1.6 to capture shimmer across each audio frame. HNR quantifies the periodicity of the voice signal, with higher values indicating more periodic (voiced) sounds and lower values suggesting noisier (aperiodic) signals. HNR was calculated using an autocorrelation-based method in Praat, analyzing the signal in short, overlapping frames. For more details, see Supplementary Material.
Statistical Analysis
Following the extraction of quantitative features, an exploratory analysis was conducted to discern statistically significant differences between the studied groups for each feature. P-values were computed using the U Mann-Whitney test designed for independent samples. Results are reported in terms of mean ± standard error mean and statistically significant p-values are coded as follows: ***p < = 0.001, ** p < 0.01 and * p < 0.05.
Deep Learning Classification Analysis
To demonstrate the potential of DL techniques for automated classification of cry patterns into ASD and TD, we trained a Recurrent Convolutional Neural Network (R-CNN) from scratch. This hybrid architecture combines the strengths of Convolutional Neural Networks (CNNs), which excel at extracting spatial features (Krizhevsky et al.,
2017; LeCun et al.,
2015) and Long Short-Term Memory (LSTM) recurrent networks, which capture extended temporal relationships in sequential data (Bahdanau et al.,
2016; Graves,
2014). Hybrid CNN-LSTM models have been successfully applied in tasks with a sequential component like video and speech recognition (Donahue et al.,
2016).
The R-CNN architecture used the extracted spectrographic information as image representations, which serves as input to the model. Input images of size 128 × 128 with a single channel (grayscale) are processed through a CNN block with a kernel size of 3 and 32 output channels, capturing spatial features.
This architecture consists of a CNN with three convolutional layers (comprising 32, 64 and 64 filters, respectively) and two LSTM layers with 288 neurons, followed by one fully-connected layer with 128 units. For the purpose of this study, 80% of the available dataset was used for model training, while the remaining 20% was set aside to assess the model performance and derive validation metrics. The data distribution includes 140 samples for TD and 147 samples for ASD in the training set, while the testing set comprises 28 samples for TD and 44 samples for ASD.
To initialize the weights of our R-CNN model, we employed the Kaiming uniform initialization (He et al.,
2015) technique, known to promote stable convergence during training. To mitigate the risk of overfitting, we implemented multiple regularization techniques. Specifically, we employed Dropout (Srivastava et al.,
2014) with a rate of 0.2 in each of the LSTM layers. In addition, to enhance generalization, we integrated two data augmentation methods, namely Frequency Masking and Time Masking (Park et al.,
2019), which were randomly applied to the training split.
The Adam optimizer (Kingma & Ba,
2017) was utilized for the gradient descent, coupled with a Cyclic learning rate (lr) scheduler (Smith,
2017). The base lr was set to 10e-6, and a maximum lr reached to 10e-5. To strike a balance between computational efficiency and model convergence, we established a batch size of 16. Throughout the training process, we monitored both Binary Cross Entropy (BCE) loss and accuracy for both the training and validation sets.
The training process spanned 2000 epochs, with the model exhibiting superior validation accuracy being designated as the final trained classifier. All aspects of the training procedure, including data preprocessing and model optimization, were implemented using the PyTorch v.2.0.1 framework.
DL performance metrics were reported in terms of accuracy (the proportion of true results -both true positives and true negatives- among the total number of cases examined), sensitivity (number of ASD instances, the proportion of true positives correctly identified by the model) and specificity (number of TD instances, the proportion of true negatives correctly identified by the model). Additional DL methods were tested, for more details referred to the Supplementary Material.
Discussion
The current study delved into the potential of DL techniques to differentiate between cries of children diagnosed with ASD and those TD, within an age range of 18 to 54 months. First, we aimed to objectively identify characteristic differences in cry patterns between the two groups through the analysis of audio features, including attributes such as jitter, shimmer, and HNR. The second aim was to establish the feasibility of leveraging an AI-based automatic system to accurately identify these characteristic differences in the spectrogram’s cry patterns.
Our study revealed notable differences between the ASD and TD groups in various frequency-based cry features. Precisely, the ASD group exhibited increased levels of jitter and shimmer coupled with a reduced HNR when contrasted with the TD group. Our results are consistent with previous research (Santos et al.,
2013), which also identified increased levels of jitter and shimmer, along with reduced HNR, as distinguishing acoustic features in children with ASD integrated into a ML model to classify ASD and TD groups. The study further emphasizes that these vocal quality differences, linked to breathiness, hoarseness, and roughness, can serve as early biomarkers for ASD or even other disorders or pathologies (Meghashree & Nataraja,
2019; Santos et al.,
2013; Teixeira & Fernandes,
2015).
Consequently, these findings indicate distinctive cry characteristics associated with ASD children, pointing to the potential value of these features in effectively discerning between the two groups during early-stage ASD screening and assessment.
Regarding the automatic classification of ASD and TD cries, previous research (Khozaei et al.,
2020; Motlagh et al.,
2013) with the same dataset used ML pattern recognition algorithms to distinguish between ASD and TD cry patterns within the ages range of 2 to 3 years. They extracted various audio features such as temporal features, energy features, harmonic features, perceptual and spectral features. Their SVM model showed an average accuracy of 89.3% accuracy including both genders (Khozaei et al.,
2020). Notwithstanding, our study is pioneering on using DL for classification of ASD and TD cries without gender differentiation being able to predict if a cry belongs to an autistic or neurotypical child with a precision of 90.28% even in a very reduced dataset.
To translate this approach into clinical practice, the proposed tool could be integrated as a complementary aid in pediatric evaluations or as an at-home screening resource for caregivers. Its non-invasive nature and ease of use would allow for continuous remote monitoring, potentially enhancing early detection and facilitating timely interventions. However, successful implementation would require clinician training and addressing ethical considerations. Further research should focus on real-world testing and integration strategies to maximize clinical utility and acceptance.
Limitations
While our findings highlight the potential of using cry acoustic features and DL for early autism identification, the study generalizability is limited by the sample size, demographic variation, and the specific age range of participants (18 to 54 months). Future research should include larger, more diverse populations across different age groups and cultural backgrounds to validate the model performance in varying developmental stages and contexts. Additionally, cry characteristics may evolve as children grow, potentially influencing model accuracy, making it essential to explore age-specific models. Another limitation is that autism diagnosis in this study was conducted using the GARS-2 parental questionnaire and a DSM-5-based parental interview by child clinical psychologists, with independent confirmation by a child psychiatrist. However, the widely used ADOS tool was not administered due to the lack of an official Farsi translation, which is less common in Iran. Expanding the dataset and refining diagnostic methodologies will be crucial for improving the robustness and applicability of this approach in broader populations.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.