Multi-View Speech Representation Learning for Parkinson's Disease Detection Using Context-guided Cross-modal Attention

Parkinson's disease (PD) is a progressive neurodegenerative disorder that frequently causes speech impairments associated with hypokinetic dysarthria. As speech production relies on the precise coordination of complex neuromuscular mechanisms, speech analysis has emerged as a promising non-invasive and cost-effective biomarker for early PD detection. Recent deep learning approaches have shown encouraging results; however, most existing methods rely on a single speech representation, potentially overlooking complementary pathological information encoded across different feature spaces. In this work, we propose a multi-branch deep learning framework for automatic PD detection from speech. Each recording is segmented into 5-second chunks and represented using three complementary modalities: Log-Mel spectrograms, MFCCs, and HuBERT embeddings extracted from raw waveforms. The spectrograms are processed using a pre-trained ResNet-18 encoder, MFCC sequences are modeled through a BiLSTM network, and raw speech is encoded using a pre-trained HuBERT model. To effectively integrate these heterogeneous representations, we introduce a context-guided cross-modal attention mechanism that dynamically weights temporal HuBERT embeddings according to the global acoustic context derived from the spectrogram and MFCC branches. Experiments conducted on the publicly available Spanish PC-GITA corpus under strict speaker-independent 5-fold cross-validation demonstrate the effectiveness of the proposed approach. The proposed architecture achieves an accuracy of 91.51%, an F1-score of 91.24%, and an AUC of 95.97%. Furthermore, ablation studies confirm the contribution of both the proposed context-guided cross-modal attention mechanism and the integration of complementary speech representations. These findings highlight the potential of heterogeneous speech modeling for robust and clinically reliable PD detection.

翻译：帕金森病（PD）是一种进行性神经退行性疾病，常导致与运动减退性构音障碍相关的言语障碍。由于言语产生依赖于复杂神经肌肉机制的精确协调，语音分析已成为一种有前景的非侵入性、低成本生物标志物，用于早期PD检测。最近的深度学习方法取得了令人鼓舞的结果；然而，现有方法大多依赖单一语音表示，可能忽略不同特征空间中编码的互补病理信息。在本工作中，我们提出了一种用于从语音中自动检测PD的多分支深度学习框架。每个录音被分割为5秒长度的片段，并使用三种互补模态表示：对数梅尔频谱图、MFCC和从原始波形中提取的HuBERT嵌入。频谱图通过预训练的ResNet-18编码器处理，MFCC序列通过BiLSTM网络建模，原始语音则通过预训练的HuBERT模型编码。为有效整合这些异质表示，我们引入了一种上下文引导的跨模态注意力机制，该机制根据从频谱图和MFCC分支中提取的全局声学上下文，动态加权时间维度的HuBERT嵌入。在公开的西班牙语PC-GITA语料库上，采用严格说话人独立的五折交叉验证，实验证明了所提方法的有效性。该架构达到了91.51%的准确率、91.24%的F1分数和95.97%的AUC值。此外，消融研究证实了所提出的上下文引导的跨模态注意力机制以及互补语音表示整合的贡献。这些发现凸显了异质语音建模在稳健且临床可靠的PD检测中的潜力。