NeuroVoz: a Castillian Spanish corpus of parkinsonian speech

Janaína Mendes-Laureano,Jorge A. Gómez-García,Alejandro Guerrero-López,Elisa Luque-Buzo,Julián D. Arias-Londoño,Francisco J. Grandas-Pérez,Juan I. Godino-Llorente

from arxiv, Preprint version

The advancement of Parkinson's Disease (PD) diagnosis through speech analysis is hindered by a notable lack of publicly available, diverse language datasets, limiting the reproducibility and further exploration of existing research. In response to this gap, we introduce a comprehensive corpus from 108 native Castilian Spanish speakers, comprising 55 healthy controls and 53 individuals diagnosed with PD, all of whom were under pharmacological treatment and recorded in their medication-optimized state. This unique dataset features a wide array of speech tasks, including sustained phonation of the five Spanish vowels, diadochokinetic tests, 16 listen-and-repeat utterances, and free monologues. The dataset emphasizes accuracy and reliability through specialist manual transcriptions of the listen-and-repeat tasks and utilizes Whisper for automated monologue transcriptions, making it the most complete public corpus of Parkinsonian speech, and the first in Castillian Spanish. NeuroVoz is composed by 2,903 audio recordings averaging $26.88 \pm 3.35$ recordings per participant, offering a substantial resource for the scientific exploration of PD's impact on speech. This dataset has already underpinned several studies, achieving a benchmark accuracy of 89% in PD speech pattern identification, indicating marked speech alterations attributable to PD. Despite these advances, the broader challenge of conducting a language-agnostic, cross-corpora analysis of Parkinsonian speech patterns remains an open area for future research. This contribution not only fills a critical void in PD speech analysis resources but also sets a new standard for the global research community in leveraging speech as a diagnostic tool for neurodegenerative diseases.

翻译：通过语音分析诊断帕金森病（PD）的进展受到公共可用、多语言数据集的显著缺乏的阻碍，这限制了现有研究的可重复性和进一步探索。为填补这一空白，我们引入了一个包含108名卡斯蒂利亚西班牙语母语者的综合语料库，包括55名健康对照者和53名确诊PD患者（所有患者均接受药物治疗并在药物优化状态下录音）。这一独特数据集涵盖多种言语任务，包括五个西班牙语元音的持续发声、构音障碍测试、16项听-重复语句以及自由独白。数据集通过专家对听-重复任务的手工转写确保准确性和可靠性，并利用Whisper对自由独白进行自动转写，成为目前最完整的公共帕金森言语语料库，也是首个卡斯蒂利亚西班牙语语料库。NeuroVoz包含2,903条录音，平均每位参与者$26.88 \pm 3.35$条录音，为科学探索PD对言语的影响提供了丰富资源。该数据集已支撑多项研究，在PD语音模式识别中达到89%的基准准确率，表明PD导致的显著言语改变。尽管取得这些进展，跨语种、跨语料库分析帕金森言语模式的更广泛挑战仍是未来研究的一个开放领域。本贡献不仅填补了PD言语分析资源的关键空白，也为全球研究界将语音作为神经退行性疾病诊断工具树立了新标准。