Most organisms including humans function by coordinating and integrating sensory signals with motor actions to survive and accomplish desired tasks. Learning these complex sensorimotor mappings proceeds simultaneously and often in an unsupervised or semi-supervised fashion. An autoencoder architecture (MirrorNet) inspired by this sensorimotor learning paradigm is explored in this work to control an articulatory synthesizer, with minimal exposure to ground-truth articulatory data. The articulatory synthesizer takes as input a set of six vocal Tract Variables (TVs) and source features (voicing indicators and pitch) and is able to synthesize continuous speech for unseen speakers. We show that the MirrorNet, once initialized (with ~30 mins of articulatory data) and further trained in unsupervised fashion (`learning phase'), can learn meaningful articulatory representations with comparable accuracy to articulatory speech-inversion systems trained in a completely supervised fashion.
翻译:包括人类在内的大多数生物体通过协调和整合感官信号与运动动作来生存并完成所需任务。这种复杂的感觉运动映射的学习过程通常是同步进行的,且多以无监督或半监督的方式实现。本研究探索了一种受该感觉运动学习范式启发的自编码器架构(MirrorNet),用于控制发音合成器,且只需极少量真实发音数据即可完成训练。该发音合成器以六个声道变量(TVs)及声源特征(浊音指示符和音高)为输入,能够为未见说话者合成连续语音。我们证明,MirrorNet在初始化后(使用约30分钟的发音数据),再通过无监督方式进一步训练("学习阶段"),即可学习到有意义的发音表征,其准确度可与完全监督训练的语音发音反演系统相媲美。