Thanks to the latest deep learning algorithms, silent speech interfaces (SSI) are now able to synthesize intelligible speech from articulatory movement data under certain conditions. However, the resulting models are rather speaker-specific, making a quick switch between users troublesome. Even for the same speaker, these models perform poorly cross-session, i.e. after dismounting and re-mounting the recording equipment. To aid quick speaker and session adaptation of ultrasound tongue imaging-based SSI models, we extend our deep networks with a spatial transformer network (STN) module, capable of performing an affine transformation on the input images. Although the STN part takes up only about 10% of the network, our experiments show that adapting just the STN module might allow to reduce MSE by 88% on the average, compared to retraining the whole network. The improvement is even larger (around 92%) when adapting the network to different recording sessions from the same speaker.
翻译:得益于最新的深度学习算法,无声语音接口(SSI)现在能够在特定条件下从发音运动数据中合成可理解的语音。然而,所得模型具有较强的说话者特异性,使得用户间的快速切换变得困难。即便对同一说话者,这些模型在跨会话场景下(即拆卸并重新安装记录设备后)表现较差。为了促进基于超声舌部成像的SSI模型的快速说话者及会话适应,我们在深度网络中扩展了一个空间变换网络(STN)模块,该模块能够对输入图像执行仿射变换。尽管STN部分仅占网络的约10%,但实验表明,与重新训练整个网络相比,仅适应STN模块可使均方误差(MSE)平均降低88%。当网络适应同一说话者的不同记录会话时,改进效果更大(约92%)。