Streaming models are an essential component of real-time speech enhancement tools. The streaming regime constrains speech enhancement models to use only a tiny context of future information. As a result, the low-latency streaming setup is generally considered a challenging task and has a significant negative impact on the model's quality. However, the sequential nature of streaming generation offers a natural possibility for autoregression, that is, utilizing previous predictions while making current ones. The conventional method for training autoregressive models is teacher forcing, but its primary drawback lies in the training-inference mismatch that can lead to a substantial degradation in quality. In this study, we propose a straightforward yet effective alternative technique for training autoregressive low-latency speech enhancement models. We demonstrate that the proposed approach leads to stable improvement across diverse architectures and training scenarios.
翻译:流式模型是实时语音增强工具的重要组成部分。流式处理机制限制了语音增强模型只能使用极少量的未来信息上下文。因此,低延迟流式处理通常被视为具有挑战性的任务,并对模型质量产生显著的负面影响。然而,流式生成的顺序特性为自回归提供了天然的可能性,即在生成当前预测时利用先前的预测结果。训练自回归模型的传统方法是教师强制法,但其主要缺陷在于训练与推理之间的不匹配,这可能导致质量大幅下降。在本研究中,我们提出了一种简单而有效的替代技术来训练自回归低延迟语音增强模型。我们证明,所提出的方法在不同架构和训练场景下均能带来稳定的性能提升。