Introduction
Autism Spectrum Disorder (ASD) is a common neurodevelopmental condition that impacts social functioning, emotions, and behaviors, with lifelong implications (American Psychiatric Association,
2013). The global prevalence of ASD in children is estimated to be 1% ( Zeidan et al.,
2022), and evidence points to a growing trend in autism diagnoses (Talantseva et al.,
2023). According to a 2020 surveillance conducted by CDS, the prevalence of ASD at 8 years was reported to be 2.76%, or approximately 1 in every 36 children (Maenner et al.,
2023). Factors contributing to this rise may include heightened awareness, wider availability of diagnostic tools, improved access to healthcare services, and a broader diagnostic definition of autism (Zeidan et al.,
2022). The diverse nature of ASD, along with varying developmental stages, gender differences, comorbidities, and intellectual functioning levels, can make diagnosis challenging (Daniels & Mandell,
2014). Early diagnosis is essential for achieving effective treatment outcomes (Corsello,
2005; Estes et al.,
2015). However, in a nationally representative sample of children with ASD in the United States, the average time between referral and diagnosis was found to be approximately 2.7 years (Zuckerman et al.,
2015). This delay in diagnosis significantly hinders early intervention during critical developmental periods, leading to substantial setbacks (Zwaigenbaum et al.,
2015).
In Türkiye, the healthcare system operates on a tiered structure where general practitioners or pediatricians typically refer suspected ASD cases to child and adolescent psychiatrists. This referral process is integrated into the Ministry of Health’s Autism Screening Program, which employs the Modified Checklist for Autism in Toddlers (M-CHAT) to identify children requiring further evaluation (Dursun et al.,
2022). Parents also have the option to directly schedule appointments with child and adolescent psychiatrists, bypassing the referral process. However, the limited number of specialists relative to the population (OECD,
2025) and socio-demographic disadvantages lead to delays in evaluation and diagnosis (Yaylaci & Guller,
2023). In addition, policies aimed at reducing wait times results in shorter evaluation sessions in clinics, typically averaging 20 min (Republic of Türkiye Ministry of Health, 2025). This time constraint limits the feasibility of employing comprehensive diagnostic guidelines and tools for ASD that require extended assessment durations. Consequently, clinicians primarily rely on widely accepted frameworks, such as the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) and the International Classification of Diseases, 11th Revision (ICD-11), to guide the diagnostic process in clinical practice, with tools like the Schedule for Affective Disorders and Schizophrenia for School-Aged Children: Present and Lifetime Version (K-SADS-PL DSM-5) being utilized predominantly in research settings.
The current gold standard approach to autism diagnosis is the tiered application of semi-structured methods, such as ADOS-2 (Lord et al.,
2012) and ADI-R (Rutter et al.,
2003), supplemented by clinical interviews based on DSM-5. However, the practicality and ease of use of these instruments are significantly limited in Türkiye by many factors including high costs, time-consuming procedures, licensing restrictions, and extensive training requirements. Moreover, these costly diagnostic tools are not necessary for every case, especially in clinical settings. Parent-based observation tools for autism assessment and screening, such as Modified Checklist for Autism in Toddlers (M-CHAT) (Kondolot et al.,
2016; Robins et al.,
2001), Autism Behavior Checklist (ABC) (Krug et al.,
1980; Ozdemir et al.,
2013), Autism Spectrum Screening Questionnaire (ASSQ) (Ehlers et al.,
1999; Kose et al.,
2015) and Social Communication Questionnaire (SCQ) (Avcil et al.,
2015; Berument et al.,
1999), are valid and reliable tools for the Turkish population, though self-reports are known to have limitations, such as over-reporting as well as under-reporting (Johnson et al.,
2009). These screening tools are frequently utilized at all levels of care for both clinical evaluation and research purposes; however, by their nature, they are not designed to serve as a diagnostic tool.
Recently, interactive screening tools like the Rapid Interactive Screening Test for Autism (RITA) (Choueiri & Wagner,
2015) have been validated in Turkish samples as second-step screening instruments to better identify high risk populations (Kadak et al.,
2024). However, the RITA is designed for use by non-experts, rather than clinicians, to facilitate triage. The Childhood Autism Rating Scale (CARS) is widely used by specialists to aid ASD diagnosis (Gassaloğlu et al.,
2016). Initially developed to distinguish between developmental delay, such as intellectual disabilities, and ASD, CARS is also used for research purposes. Clinicians employ CARS to evaluate various areas in addition to specific ASD-related items (Schopler et al.,
1980). However, compared to the AMSE, which is seamlessly integrated into clinical observation and requires approximately 5–10 min for scoring, the CARS demands significantly more time to be used effectively, typically ranging from 20 to 30 min depending on the clinician’s level of expertise (Kılınç et al.,
2019). The shorter administration time may be attributed to AMSE’s fewer items, simplified Likert scale, reliance on rapid observational assessment, and more practical scoring system compared to CARS. In addition to these challenges regarding ASD diagnosis, the competence and experience of healthcare professionals other than child and adolescent psychiatrists in Türkiye are also a subject of concern. Although arguable, (Tamur & Şen Celasin,
2022) some studies proposed that primary care physicians often lack the expertise and face significant challenges in effectively managing cases with ASD (Alpdoğan et al.,
2023).
The AMSE was developed by Grodberg et al. (
2012) through collaborative efforts involving child psychiatrists, behavioral pediatricians, and child neurologists as a diagnostic assessment tool based on observational cues for the core symptoms of ASD, including social interaction, communication skills, and behavioral patterns (Grodberg et al.,
2012). Unlike parent-report screening tools such as the M-CHAT, ABC, ASSQ or SCQ, which rely on caregiver observations, the AMSE is a clinician-administered tool that examines both direct behavioral symptoms and caregiver reports, providing a structured, real-time assessment of ASD symptoms. Additionally, the AMSE helps standardize the clinical observation process, ensuring a more consistent and objective evaluation across different clinical settings. The AMSE is a brief, easy-to-use, and free observational tool that does not add to clinicians’ workload, supporting ASD diagnosis. In addition to its original development studies in the USA (Grodberg et al.,
2012,
2014,
2016), the validity of AMSE has been demonstrated in several countries, including France (Pagnier & Chaste,
2022), Norway (Øien et al.,
2018,
2020), Sweden (Cederlund,
2019), Chile (Irarrázaval et al.,
2023), China (Yang et al.,
2023), and Brazil (Galdino et al.,
2020). AMSE has also been successful in achieving diagnostic differentiation with high specificity and sensitivity for developmental delay (Betz et al.,
2019), ADHD (Øien et al.,
2020), and anxiety comorbidities (Arnold et al.,
2016) both in individuals with and without ASD. The psychometric properties of AMSE have been examined in different languages and populations; however, no studies have previously investigated its psychometric properties in Türkiye. This study is the first to assess its validity, reliability, and cut-off values in the Turkish population.
In Türkiye, there is a lack of diagnostic observation tools that standardize ASD examinations in clinical practice- tools that are easy to apply, low-cost, and time-efficient. Globally, there is a need for quick, practical observation tools that integrate feedback without increasing the clinician’s workload in the diagnosis of ASD. This study aims to examine the psychometric properties of AMSE, including its validity, reliability, interrater reliability, and cut-off scores, in Turkish children with suspected ASD.
Discussion
AMSE is a valuable diagnostic aid that can incorporate caregiver reports along with observation-based measurements without adding an extra burden to clinical practice (Grodberg et al.,
2012). The results of the present study demonstrated that the AMSE can effectively differentiate ASD symptoms with high sensitivity and specificity, particularly when utilizing a cut-off score of 4. It showed excellent inter-rater reliability and temporal stability. This study is the first to examine the psychometric properties of the AMSE using a Turkish sample, establishing it as a valid and reliable diagnostic observation tool.
When the results of multidisciplinary evaluations based on the DSM-5 criteria were considered as the comparison standard for diagnosis, the predictive value was found to be 4, with a sensitivity of 84.16% and specificity of 97.03%. When this value was 5, the sensitivity decreased to 77.23%, but the specificity increased to 99.5%. Similarly, Galdino et al. (
2020) found that a predictive value of 4 was optimal for a Brazilian sample of 260 children, reporting a sensitivity of 91% and specificity of 98% (Galdino et al.,
2020). In addition to the numbers being relatively close to each other, notable methodological similarities include the diagnostic processes being similar and the control groups consisting of clinical samples. However, the age range in the Brazilian sample also included adolescents older than those in our population (
M = 9.1;
SD = ± 3.4). Although the majority of previous studies reported a cut-off value of 5 (Betz et al.,
2019; Grodberg et al.,
2012,
2014; Øien et al.,
2020; Pagnier & Chaste,
2022), there were also studies that reported cut-off values of 6 (Grodberg et al.,
2016; Irarrázaval et al.,
2023; Yang et al.,
2023) and 7 (Cederlund,
2019). The variation in cut-off scores across different studies does not appear to be solely attributable to the age characteristics of the sample. For example, Grodberg et al. (
2016) reported an optimal cut-off value of 6 in a population with a mean age of 41.1 months (± 12.5 months) which is comparable to the mean age in our study of 35.5 months (± 19.04 months) (Grodberg et al.,
2016). Similarly, a study conducted with a preschool sample in Sweden identifed an optimal cut-off value of 7 (Cederlund,
2019). Pagnier and Chaste (
2022) argued that the optimal predictive value varies across different age and sex groups, with these parameters significantly influencing cut-off scores, whereas adaptive behaviors showed no such effect. Their study revealed age-related variations in optimal cut-off values, with a cut-off of 6 (sensitivity: 0.92, specificity: 0.67) for participants aged 0 to 36 months, a cut-off of 5 (sensitivity: 0.98, specificity: 0.62) for those aged 3 to 6 years, and a cut-off of 6 (sensitivity: 0.86, specificity: 0.67) for participants aged 6 years and older highlighting the absence of a linear variability relationship specific to age. They further suggested that a specific range between scores of 3 and 9 might be more suitable for in-depth examination and guidance. The authors proposed that cut-off scores could support expert clinicians during clinical interviews and facilitate the direct referral of cases within certain ranges to tertiary healthcare centers by primary care health professionals, thereby reducing the need for additional testing and preventing diagnostic delays (Pagnier & Chaste,
2022). Irarrázaval et al. (
2023) argued that the diagnostic capacity of AMSE is more effective in younger individuals and those with lower language abilities, despite the cut-off value remaining consistent across different verbal skills (Irarrázaval et al.,
2023). Although definitive conclusions cannot be drawn from the current data, discrepancies in specifity, sensitivity and optimal cut-off across studies may be attributed to variations in sex, sample size, sample characteristics (including comorbidities), diagnostic instruments and scales used, as well as the level of expertise among practitioners. This study consists of a sample of participants identified as suspected ASD cases, the majority of whom referred due to neurodevelopmental complaints such as speech delay. While this aligns with the AMSE’s intended target population, it should be noted that the the psychometric properties of AMSE may have been influenced by these parameters excluded from the scope of this study. Since this study was conducted with a sample consisting of individuals suspected of having ASD, caution is warranted when attempting to generalize the findings to the entire population. Collectively, these findings indicate solid sensitivity and specificity values in the Turkish sample. Well-designed future studies with rigorous control of confounding factors are needed to clarify variations in the results.
The internal consistency of the AMSE was very good (Cronbach’s alpha = 0.8). This value is one of the highest among the studies conducted, as previous studies have mostly reported values within the good (Galdino et al.,
2020; Grodberg et al.,
2012) and acceptable ranges (Arnold et al.,
2016; Irarrázaval et al.,
2023; Pagnier & Chaste,
2022; Yang et al.,
2023). When Language Pragmatics items were removed, this value increased to the highest level of 0.83. Upon examining item-rest correlations, the lowest value was Language Pragmatics, while the highest value was Interest of Others, in accordance with previous results (Arnold et al.,
2016; Galdino et al.,
2020; Pagnier & Chaste,
2022). AMSE instructs the observer to skip item 5 if the language item is scored 0 (
https://autismmentalstatusexam.com/index). Irarrázaval et al. (
2023) reported that they removed item 5 from the internal consistency analysis because it could not be scored by more than half of the participants with ASD (Irarrázaval et al.,
2023). Assessing the pragmatic use of language is often not possible in both young children and in many cases where ASD severely affects language development. The majority of children with ASD often scored 0 points on this item, leading to low variability. Additionally, this item includes both feedback and observation-based components. This may explain why it contributed minimally to the internal consistency of the scale. Overall, the results indicated that the internal consistency of the AMSE has been confirmed to varying degrees in different translations and populations.
In this study, the structural validity of the AMSE was examined through EFA and CFA analysis. To the best of our knowledge, no previous research has investigated the factorial structure of the AMSE. The results of both EFA and CFA support a two-factor model, which aligns with the theoretical foundations of the AMSE as conceptualized within the DSM framework. This structure reinforces the distinction between social-communicative and behavioral-sensory aspects of ASD symptomatology. Empirical validation of the factor structure confirms the robustness and psychometric soundness of the instrument in assessing the two core domains of ASD.
To assess the temporal stability of the scale, a test-retest procedure was conducted, and repeated measurements showed excellent correlation (
ICC = 0.959). This result indicates that the scale is a reliable instrument for distinguishing between the ASD and non-ASD groups over time. When the ICC is 0.9 or higher, a minimum of 6 cases meets the statistical criteria with 90% power and a significance level of
p = 0.05 (Bujang & Baharum,
2017). Evaluating the test-retest results with a sample of 61 participants, including 20 ASD cases, provides strong evidence for the longitudinal stability of the scale, demonstrating its robust reliability over time. To the best of our knowledge, previous studies have not evaluated its reliability, making this study the first to do so. However, to generalize these results to different samples and populations, replication of the results is warranted.
Additionally, the inter-rater reliability of this tool was excellent (
ICC = 0.997). In previous studies examining the psychometric properties of the AMSE, inter-rater assessment was also found to be good to excellent (Arnold et al.,
2016; Grodberg et al.,
2012; Yang et al.,
2023). This indicates the success of the test in standardizing observable behaviors and enhancing the validity of the results across different practitioners in diagnosis and scientific research. These results may stem from the fact that studies are generally conducted by expert teams in ASD and the guidelines consist of familiar clinical observations. It is important to repeat these findings, especially among primary healthcare professionals, such as psychologists, child development specialists, and family physicians, to validate the results. Therefore, a careful interpretation is warranted.
In the validation results of this study, AMSE showed an excellent correlation with CARS (
r = 0.94;
p < 0.001) in overall sample. Previous studies have found significant and strong relationships between CARS scores and AMSE scores to varying degrees, such as 0.44 (with CARS-2) (Arnold et al.,
2016), 0.70 (with CARS) (Pagnier & Chaste,
2022), 0.74 (with CARS) (Yang et al.,
2023) and 0.91 (with CARS-BR) (Galdino et al.,
2020). The results of this study are similar to those a of study conducted in Brazil, which may be due to methodological similarities and sample size (Galdino et al.,
2020). The differences between the study findings may be due to the use of different versions of the CARS and the presence or absence of cases accompanied by developmental delay. ASD exhibits a clinically diverse and heterogeneous phenotype influenced by multiple factors, including age, coexisting conditions such as language impairments and developmental delay, as well as the severity of the ASD itself. Consequently, the AMSE may demonstrate varying psychometric properties across populations characterized by these differing variables. Further studies should be planned by homogenizing groups to achieve the most efficient psychometric measurement tools.
In this study, the correlation between AMSE and CARS was also analyzed separately for ASD and non-ASD groups. The correlation between AMSE and CARS scores was found to be strong in the overall sample (r = 0.94); however, when analyzed separately, the correlation remained high in the ASD group (r = 0.88) but was weaker in the non-ASD group (r = 0.45). This discrepancy may stem from several statistical factors, including between-group differences and variance restrictions. A key explanation for this pattern is the large between-group effect size which likely inflated the overall correlation by amplifying the contrast between ASD and non-ASD participants. Additionally, within the non-ASD group, the restricted range of CARS scores may have contributed to reduced observable variability, consequently weakening the correlation—a phenomenon known as range restriction. It is important to note that the lower correlation within the non-ASD group does not necessarily imply a lack of validity but rather reflects the statistical effects of variance differences and range restriction.
This study was conducted with children referred to our clinic for further assessment owing to the risk of ASD. In the ASD group, the AMSE scores were found to be significantly higher after controlling for age (
p < 0.001;
Cohen’s d = 1.40). Similar results were found in studies that included a control group (Cederlund,
2019; Øien et al.,
2018; Pagnier & Chaste,
2022). Betz et al. (
2019) showed that AMSE could distinguish between children with developmental delay and those with ASD, with significantly higher AMSE scores in children with ASD (Betz et al.,
2019). Although it may not be possible to assess the items’ sub-validity and differentiation due to the evaluation of total AMSE scores, the consistent results contribute to the psychometric strength of the tool. These findings demonstrate that AMSE’s diagnostic differentiation remains robust and independent of age effects.
In terms of clinical implications, strong psychometric properties of the AMSE make it a valuable tool for diagnostic assessments conducted by child and adolescent mental health specialists in Türkiye, particularly due to its practicality and efficiency in short examination times. While it effectively supports diagnostic decisions and exclusions in many cases, its use as a screening tool by non-specialists remains unsupported by evidence, despite suggestions from some authors (Galdino et al.,
2020; Pagnier & Chaste,
2022). It is important to note that while the AMSE score may suggest an autism diagnosis, it is not sufficient on its own, regardless of case complexity, to establish a definitive diagnosis of autism. The tool was specifically developed for professionals with ASD expertise, and not for primary care providers. In consideration of the objectives of AMSE’s development and the studies conducted, its most appropriate application context appears to be assisting in ASD diagnosis in suspected cases.
The AMSE offers a valuable contribution to ASD assessment by bridging the gap between comprehensive diagnostic tools and parent-report screening measures. It provides a clinician-administered, rapid, and observation-based evaluation of ASD symptoms, integrating both direct behavioral observations and caregiver reports. Unlike self-report tools, the AMSE minimizes reporting bias and facilitates real-time assessment, enhancing diagnostic accuracy. Its structured approach helps standardize clinical observations, promoting consistency while remaining an accessible, practical, and efficient tool for ASD assessment. We recommend that future research should focus on evaluating its reliability across healthcare professionals with varying levels of expertise using standardized methodologies, particularly assessing its feasibility for general practitioners and determining the extent of training required for its effective implementation. In addition, regarding the investigation of AMSE’s item-level functioning, particularly items such as Language Pragmatics, further research is needed to explore its item-level discrimination and ability to detect varying ASD severity levels. We recommend conducting studies with larger and more diverse samples and utilizing advanced analytical methods, such as item response theory analyses, including test information curve, to enhance measurement precision. Finally, the cost-effectiveness of AMSE in non-expert settings remains a concern and should be further investigated. While the AMSE’s online training platform may offer a potential solution, its impact on clinical accuracy and utility has yet to be systematically evaluated.
This study has several strengths, including a relatively large sample size, assessment of validity over time, verification of structural validity through further exploration, and evaluation of psychometric properties in children within the common diagnostic age range for suspected ASD. However, this study has some notable limitations. The primary limitation is that semi-structured diagnostic tools, such as the ADOS-2 and ADI-R, were not utilized, despite the diagnoses being made by a multidisciplinary team. Additionally, the cases were categorized solely into two groups, ASD and non-ASD, and the study did not include an assessment of comorbidities in ASD cases or further diagnostic evaluations in non-ASD cases. This constitutes a limitation in evaluating how the AMSE is affected by factors such as language, intelligence, and social-emotional development. Another limitation of the study might be the test-retest process was conducted by the same clinician who administered the initial diagnostic interview, with a three-week interval between assessments. This may have introduced recall bias, as the clinician’s prior knowledge of the initial evaluation findings could have influenced the AMSE scoring during the second assessment, particularly in parent-reported behaviors.