We present a fully on-device and streaming Speech-To-Speech conversion model that normalizes a given input speech directly to synthesized output speech (a.k.a. Parrotron). Deploying such an end-to-end model locally on mobile devices pose significant challenges in terms of memory footprint and computation requirements. In this paper, we present a streaming-based approach to produce an acceptable delay, with minimal loss in speech conversion quality, when compared to a reference state of the art non-streaming approach. Our method consists of first streaming the encoder in real time while the speaker is speaking. Then, as soon as the speaker stops speaking, we run the spectrogram decoder in streaming mode along the side of a streaming vocoder to generate output speech in real time. To achieve an acceptable delay-quality trade-off, we propose a novel hybrid approach for look-ahead in the encoder which combines a look-ahead feature stacker with a look-ahead self-attention. We also compare the model with int4 quantization aware training and int8 post training quantization and show that our streaming approach is 2x faster than real time on the Pixel4 CPU.
翻译:我们提出了一种完全在设备上运行的流式语音到语音转换模型,该模型能够将输入语音直接归一化为合成的输出语音(即Parrotron)。在移动设备上本地部署这种端到端模型在内存占用和计算需求方面面临重大挑战。本文提出了一种基于流式的方法,与参考的非流式最先进方法相比,该方法在产生可接受延迟的同时,语音转换质量损失最小。我们的方法首先在说话者发言时实时流式运行编码器。然后,一旦说话者停止发言,我们便以流式模式运行频谱解码器,并配合流式声码器实时生成输出语音。为了实现可接受的延迟-质量权衡,我们提出了一种新颖的混合方法用于编码器中的前瞻机制,该方法将前瞻特征堆叠器与前瞻自注意力相结合。我们还对模型进行了int4量化感知训练和int8训练后量化比较,结果表明我们的流式方法在Pixel4 CPU上的速度是实时的2倍。