Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations
翻译:语音通过同时调动神经、运动、呼吸和发声系统,为健康监测提供了独特的信息窗口。当前临床语音AI方法大多通过孤立的特定疾病研究取得进展,导致结果难以比较,泛化能力难以评估。我们提出SpeechDx——一个涵盖12个数据集、27项任务、覆盖多种健康状况的大规模临床语音AI基准测试。为使评估贯穿共享的临床机制,SpeechDx根据任务所破坏的语音产生阶段(概念化、构词化、发音化)进行结构化组织。该基准通过纳入标注数据有限的任务,并在多个数据集上评估同一健康状况来测试泛化能力,从而区分具有临床意义的模式与数据集伪影。我们系统评估了12种最先进的音频编码器在所有任务上的表现,并进行了零样本跨条件迁移测试。结果表明:大规模语音模型构成了最强总体基线,领域特定模型仅在高度匹配的任务上提升性能,而当前尚无一种表示能在临床语音全景中可靠泛化。SpeechDx为追踪通用临床语音表示的研究进展建立了共享评估框架。