Real-time Magnetic Resonance Imaging (rtMRI) visualizes vocal tract action, offering a comprehensive window into speech articulation. However, its signals are high dimensional and noisy, hindering interpretation. We investigate compact representations of spatiotemporal articulatory dynamics for phoneme recognition from midsagittal vocal tract rtMRI videos. We compare three feature types: (1) raw video, (2) optical flow, and (3) six linguistically-relevant regions of interest (ROIs) for articulator movements. We evaluate models trained independently on each representation, as well as multi-feature combinations. Results show that multi-feature models consistently outperform single-feature baselines, with the lowest phoneme error rate (PER) of 0.34 obtained by combining ROI and raw video. Temporal fidelity experiments demonstrate a reliance on fine-grained articulatory dynamics, while ROI ablation studies reveal strong contributions from tongue and lips. Our findings highlight how rtMRI-derived features provide accuracy and interpretability, and establish strategies for leveraging articulatory data in speech processing.
翻译:实时磁共振成像(rtMRI)能够可视化声道动作,为语音发音过程提供了一个全面的观测窗口。然而,其信号具有高维且含噪声的特点,这阻碍了对其的解读。本研究针对从声道正中矢状面rtMRI视频中识别音素的任务,探索了发音时空动态的紧凑表示方法。我们比较了三种特征类型:(1)原始视频,(2)光流,以及(3)六个与发音器运动相关的语言学感兴趣区域(ROIs)。我们评估了基于每种表示独立训练的模型,以及多特征组合模型。结果表明,多特征模型始终优于单特征基线,其中ROI与原始视频组合获得了最低的音素错误率(PER)0.34。时序保真度实验证明了模型对细粒度发音动态的依赖,而ROI消融研究则揭示了舌头和嘴唇区域的重要贡献。我们的研究结果凸显了rtMRI衍生特征在提供准确性和可解释性方面的价值,并为在语音处理中利用发音数据确立了策略。