Speech motion anomaly detection via cross-modal translation of 4D motion fields from tagged MRI

Understanding the relationship between tongue motion patterns during speech and their resulting speech acoustic outcomes -- i.e., articulatory-acoustic relation -- is of great importance in assessing speech quality and developing innovative treatment and rehabilitative strategies. This is especially important when evaluating and detecting abnormal articulatory features in patients with speech-related disorders. In this work, we aim to develop a framework for detecting speech motion anomalies in conjunction with their corresponding speech acoustics. This is achieved through the use of a deep cross-modal translator trained on data from healthy individuals only, which bridges the gap between 4D motion fields obtained from tagged MRI and 2D spectrograms derived from speech acoustic data. The trained translator is used as an anomaly detector, by measuring the spectrogram reconstruction quality on healthy individuals or patients. In particular, the cross-modal translator is likely to yield limited generalization capabilities on patient data, which includes unseen out-of-distribution patterns and demonstrates subpar performance, when compared with healthy individuals.~A one-class SVM is then used to distinguish the spectrograms of healthy individuals from those of patients. To validate our framework, we collected a total of 39 paired tagged MRI and speech waveforms, consisting of data from 36 healthy individuals and 3 tongue cancer patients. We used both 3D convolutional and transformer-based deep translation models, training them on the healthy training set and then applying them to both the healthy and patient testing sets. Our framework demonstrates a capability to detect abnormal patient data, thereby illustrating its potential in enhancing the understanding of the articulatory-acoustic relation for both healthy individuals and patients.

翻译：理解言语过程中舌部运动模式与其产生的语音声学结果之间的关系（即发音-声学关系），对于评估语音质量以及开发创新治疗与康复策略具有重要意义。这在评估和检测言语障碍患者的异常发音特征时尤为关键。本研究旨在开发一个框架，用于结合相应语音声学特征检测言语运动异常。该框架通过使用仅在健康个体数据上训练的深度跨模态翻译器实现，该翻译器桥接了从标记MRI获取的4D运动场与从语音声学数据导出的2D频谱图之间的差距。训练后的翻译器通过测量健康个体或患者的频谱图重建质量，作为异常检测器使用。具体而言，跨模态翻译器在患者数据（包含未见过的分布外模式）上的泛化能力有限，与健康个体相比表现较差。随后使用一类支持向量机区分健康个体与患者的频谱图。为验证该框架，我们收集了39组配对标记MRI与语音波形数据，包括36名健康个体和3名舌癌患者的数据。我们分别采用3D卷积与基于Transformer的深度翻译模型，在健康训练集上训练模型，并应用于健康与患者测试集。实验表明，该框架能够检测异常患者数据，从而彰显其在增进理解健康个体与患者的发音-声学关系方面的潜力。