Thanks to the latest deep learning algorithms, silent speech interfaces (SSI) are now able to synthesize intelligible speech from articulatory movement data under certain conditions. However, the resulting models are rather speaker-specific, making a quick switch between users troublesome. Even for the same speaker, these models perform poorly cross-session, i.e. after dismounting and re-mounting the recording equipment. To aid quick speaker and session adaptation of ultrasound tongue imaging-based SSI models, we extend our deep networks with a spatial transformer network (STN) module, capable of performing an affine transformation on the input images. Although the STN part takes up only about 10\% of the network, our experiments show that adapting just the STN module might allow to reduce MSE by 88\% on the average, compared to retraining the whole network. The improvement is even larger (around 92\%) when adapting the network to different recording sessions from the same speaker.
翻译:得益于最新的深度学习算法,静默语音接口(SSI)现能在特定条件下根据发音运动数据合成可理解语音。然而,现有模型具有显著的说话人特异性,导致用户间的快速切换十分困难。即便针对同一说话人,这些模型在跨会话场景(即拆卸并重新安装录音设备后)表现欠佳。为加速基于超声舌成像的SSI模型的说话人与会话自适应,我们在深度网络中引入空间变换网络(STN)模块,该模块可对输入图像执行仿射变换。尽管STN部分仅占网络参数的10%,实验表明:相较于重新训练整个网络,仅自适应STN模块可平均降低均方误差达88%;当自适应同一说话人不同会话的录音数据时,性能提升更为显著(约92%)。