Thanks to the latest deep learning algorithms, silent speech interfaces (SSI) are now able to synthesize intelligible speech from articulatory movement data under certain conditions. However, the resulting models are rather speaker-specific, making a quick switch between users troublesome. Even for the same speaker, these models perform poorly cross-session, i.e. after dismounting and re-mounting the recording equipment. To aid quick speaker and session adaptation of ultrasound tongue imaging-based SSI models, we extend our deep networks with a spatial transformer network (STN) module, capable of performing an affine transformation on the input images. Although the STN part takes up only about 10% of the network, our experiments show that adapting just the STN module might allow to reduce MSE by 88% on the average, compared to retraining the whole network. The improvement is even larger (around 92%) when adapting the network to different recording sessions from the same speaker.
翻译:借助最新的深度学习算法,静默语音接口(SSI)在特定条件下能够从发音运动数据中合成可理解的语音。然而,由此产生的模型具有显著的说话人特异性,导致用户间的快速切换十分困难。即使对于同一说话人,这些模型在跨会话场景下(即拆卸并重新安装记录设备后)表现不佳。为解决基于超声舌部成像的SSI模型的说话人及会话快速自适应问题,我们在深度网络中扩展了空间变换网络(STN)模块,该模块可对输入图像执行仿射变换。尽管STN部分仅占网络总参数量约10%,但实验表明,与重新训练整个网络相比,仅自适应STN模块可使均方误差(MSE)平均降低88%。当将网络自适应至同一说话人的不同记录会话时,性能提升更为显著(约92%)。