Modality-Order Matters! A Novel Hierarchical Feature Fusion Method for CoSAm: A Code-Switched Autism Corpus

Autism Spectrum Disorder (ASD) is a complex neuro-developmental challenge, presenting a spectrum of difficulties in social interaction, communication, and the expression of repetitive behaviors in different situations. This increasing prevalence underscores the importance of ASD as a major public health concern and the need for comprehensive research initiatives to advance our understanding of the disorder and its early detection methods. This study introduces a novel hierarchical feature fusion method aimed at enhancing the early detection of ASD in children through the analysis of code-switched speech (English and Hindi). Employing advanced audio processing techniques, the research integrates acoustic, paralinguistic, and linguistic information using Transformer Encoders. This innovative fusion strategy is designed to improve classification robustness and accuracy, crucial for early and precise ASD identification. The methodology involves collecting a code-switched speech corpus, CoSAm, from children diagnosed with ASD and a matched control group. The dataset comprises 61 voice recordings from 30 children diagnosed with ASD and 31 from neurotypical children, aged between 3 and 13 years, resulting in a total of 159.75 minutes of voice recordings. The feature analysis focuses on MFCCs and extensive statistical attributes to capture speech pattern variability and complexity. The best model performance is achieved using a hierarchical fusion technique with an accuracy of 98.75% using a combination of acoustic and linguistic features first, followed by paralinguistic features in a hierarchical manner.

翻译：自闭症谱系障碍（ASD）是一种复杂的神经发育障碍，表现为在不同情境下社交互动、沟通以及重复行为表达方面的一系列困难。其日益增长的患病率凸显了ASD作为一项重大公共卫生问题的重要性，以及推进对该障碍及其早期检测方法理解的综合研究计划的必要性。本研究引入了一种新颖的分层特征融合方法，旨在通过分析代码转换语音（英语和印地语）来增强儿童ASD的早期检测。该研究采用先进的音频处理技术，利用Transformer编码器整合声学、副语言学和语言学信息。这种创新的融合策略旨在提高分类的鲁棒性和准确性，这对早期精确识别ASD至关重要。该方法涉及从诊断为ASD的儿童和匹配的对照组中收集一个代码转换语音语料库CoSAm。该数据集包含来自30名诊断为ASD的儿童的61条语音录音和来自31名神经典型儿童的语音录音，年龄在3至13岁之间，总计159.75分钟的录音。特征分析侧重于MFCC和广泛的统计属性，以捕捉语音模式的变异性和复杂性。最佳模型性能是通过分层融合技术实现的，首先结合声学和语言学特征，然后以分层方式结合副语言学特征，准确率达到98.75%。