We present a fully on-device streaming Speech2Speech conversion model that normalizes a given input speech directly to synthesized output speech. Deploying such a model on mobile devices pose significant challenges in terms of memory footprint and computation requirements. We present a streaming-based approach to produce an acceptable delay, with minimal loss in speech conversion quality, when compared to a reference state of the art non-streaming approach. Our method consists of first streaming the encoder in real time while the speaker is speaking. Then, as soon as the speaker stops speaking, we run the spectrogram decoder in streaming mode along the side of a streaming vocoder to generate output speech. To achieve an acceptable delay-quality trade-off, we propose a novel hybrid approach for look-ahead in the encoder which combines a look-ahead feature stacker with a look-ahead self-attention. We show that our streaming approach is almost 2x faster than real time on the Pixel4 CPU.
翻译:我们提出了一种完全在设备上运行的流式语音到语音转换模型,该模型能够将给定的输入语音直接归一化为合成的输出语音。在移动设备上部署此类模型面临内存占用和计算需求方面的重大挑战。我们提出了一种基于流式的方法,与参照的先进非流式方法相比,在语音转换质量损失最小的前提下,实现了可接受的延迟。我们的方法包括:首先在说话者说话时实时对流式编码器进行流式处理;然后,一旦说话者停止说话,我们便以流式模式运行频谱解码器,同时配合流式声码器生成输出语音。为了实现可接受的延迟-质量权衡,我们提出了一种新颖的混合方法用于编码器的前视,该方法结合了前视特征堆叠器和前视自注意力机制。我们证明,在Pixel4 CPU上,我们的流式方法速度几乎是实时处理的两倍。