SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

翻译：语音通过同时调动神经、运动、呼吸和发声系统，为健康监测提供了独特的信息窗口。当前临床语音AI方法大多通过孤立的特定疾病研究取得进展，导致结果难以比较，泛化能力难以评估。我们提出SpeechDx——一个涵盖12个数据集、27项任务、覆盖多种健康状况的大规模临床语音AI基准测试。为使评估贯穿共享的临床机制，SpeechDx根据任务所破坏的语音产生阶段（概念化、构词化、发音化）进行结构化组织。该基准通过纳入标注数据有限的任务，并在多个数据集上评估同一健康状况来测试泛化能力，从而区分具有临床意义的模式与数据集伪影。我们系统评估了12种最先进的音频编码器在所有任务上的表现，并进行了零样本跨条件迁移测试。结果表明：大规模语音模型构成了最强总体基线，领域特定模型仅在高度匹配的任务上提升性能，而当前尚无一种表示能在临床语音全景中可靠泛化。SpeechDx为追踪通用临床语音表示的研究进展建立了共享评估框架。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《SysEngBench：评估系统工程中大型语言模型的新基准》美海军最新报告

专知会员服务

51+阅读 · 2024年6月30日

[ICML2024] Spotlight|DAT：通过交互式注意力实现统一的多粒度文本检测

专知会员服务

19+阅读 · 2024年6月26日