Large audio and language models have recently demonstrated zero-shot reasoning capabilities across various domains. However, it remains unclear how the form of audio input, whether handcrafted acoustic features extracted from speech or the raw audio waveform itself, affects performance for Parkinson's disease (PD) detection across different languages. In this study, we systematically compare two input modalities for zero-shot PD detection: (i) handcrafted acoustic features extracted from speech recordings analyzed by a general-purpose LLM, and (ii) direct waveform input analyzed by audio-capable models. Experiments on PD speech datasets in four languages show that performance varies across input modalities, speech tasks, and languages. Handcrafted acoustic features provide more stable performance in a low-resource language (e.g., Bengali), whereas audio input yields dataset-dependent gains. These findings highlight the impact of input modality on zero-shot PD detection from speech.
翻译:近期,大型音频与语言模型展现了跨领域的零-shot推理能力。然而,音频输入形式——无论是从语音中提取的专家设计的声学特征,还是原始音频波形——对跨语言帕金森病(PD)检测性能的影响仍不明确。本研究系统比较了两种零-shot PD检测的输入模态:(i)由通用大语言模型(LLM)分析的从语音录音中提取的专家声学特征,以及(ii)由音频模型分析的直接波形输入。在四种语言的PD语音数据集实验表明,性能因输入模态、语音任务和语言而异。专家设计的声学特征在低资源语言(如孟加拉语)中提供更稳定的性能,而音频输入则带来数据集依赖性的提升。这些发现凸显了输入模态对语音零-shot PD检测的影响。