Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix Factorization via Plastic Transformer

The tongue's intricate 3D structure, comprising localized functional units, plays a crucial role in the production of speech. When measured using tagged MRI, these functional units exhibit cohesive displacements and derived quantities that facilitate the complex process of speech production. Non-negative matrix factorization-based approaches have been shown to estimate the functional units through motion features, yielding a set of building blocks and a corresponding weighting map. Investigating the link between weighting maps and speech acoustics can offer significant insights into the intricate process of speech production. To this end, in this work, we utilize two-dimensional spectrograms as a proxy representation, and develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms. Our proposed plastic light transformer (PLT) framework is based on directional product relative position bias and single-level spatial pyramid pooling, thus enabling flexible processing of weighting maps with variable size to fixed-size spectrograms, without input information loss or dimension expansion. Additionally, our PLT framework efficiently models the global correlation of wide matrix input. To improve the realism of our generated spectrograms with relatively limited training samples, we apply pair-wise utterance consistency with Maximum Mean Discrepancy constraint and adversarial training. Experimental results on a dataset of 29 subjects speaking two utterances demonstrated that our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models.

翻译：舌头作为由局部功能单元构成的复杂三维结构，在语音产生过程中发挥着关键作用。通过标记MRI测量时，这些功能单元会展现出协同位移及衍生物理量，从而支撑语音产生的复杂机制。基于非负矩阵分解的方法已证实能够通过运动特征估算功能单元，生成一组基础构建模块及其对应的权重映射图。探究权重映射图与语音声学特征之间的关联，可为理解语音产生的精妙过程提供重要洞见。为此，本研究采用二维频谱图作为代理表征，构建了从权重映射图到对应音频波形的端到端深度学习框架。我们提出的塑性轻量Transformer（PLT）框架基于方向性乘积相对位置偏置与单级空间金字塔池化，实现了可变尺寸权重映射图到固定尺寸频谱图的灵活转换，既避免了输入信息丢失也无需维度扩展。同时，PLT框架能有效建模宽矩阵输入的全局相关性。为在有限训练样本条件下提升生成频谱图的真实感，我们创新性地引入基于最大均值差异约束的成对语句一致性机制与对抗训练策略。在29名受试者朗读两个语句的数据集上的实验表明，本框架能从权重映射图成功合成语音音频波形，其性能超越传统卷积神经网络及Transformer模型。