We propose ARTI-6, a compact six-dimensional articulatory speech encoding framework derived from real-time MRI data that captures crucial vocal tract regions including the velum, tongue root, and larynx. ARTI-6 consists of three components: (1) a six-dimensional articulatory feature set representing key regions of the vocal tract; (2) an articulatory inversion model, which predicts articulatory features from speech acoustics leveraging speech foundation models, achieving a prediction correlation of 0.87; and (3) an articulatory synthesis model, which reconstructs intelligible speech directly from articulatory features, showing that even a low-dimensional representation can generate natural-sounding speech. Together, ARTI-6 provides an interpretable, computationally efficient, and physiologically grounded framework for advancing articulatory inversion, synthesis, and broader speech technology applications. The source code and speech samples are publicly available.
翻译:我们提出了ARTI-6,一个从实时磁共振成像数据中推导出的紧凑六维发音语音编码框架,该框架捕获了包括软腭、舌根和喉部在内的关键声道区域。ARTI-6包含三个组成部分:(1) 一个代表声道关键区域的六维发音特征集;(2) 一个发音反演模型,该模型利用语音基础模型从语音声学中预测发音特征,实现了0.87的预测相关性;(3) 一个发音合成模型,该模型直接从发音特征重建可理解的语音,表明即使是低维表示也能生成听起来自然的语音。总之,ARTI-6为推进发音反演、合成及更广泛的语音技术应用提供了一个可解释、计算高效且具有生理学基础的框架。源代码和语音样本已公开提供。