To ensure the reliable operation of speech systems across diverse environments, noise addition methods have emerged as the prevailing solution. However, existing methods offer limited coverage of real-world noisy scenes and depend on pre-existing scene-based information and noise. This paper presents prompt-based Dynamic Generative Scene-based Noise Addition (DGSNA), a novel noise addition methodology that integrates Dynamic Generation of Scene-based Information (DGSI) with Scene-based Noise Addition for Speech (SNAS). This integration facilitates automated scene-based noise addition by transforming clean speech into various noise environments, thereby providing a more comprehensive and realistic simulation of diverse noise conditions. Experimental results demonstrate that DGSNA significantly enhances the robustness of speech recognition and keyword spotting models across various noise conditions, achieving a relative improvement of up to 11.21%. Furthermore, DGSNA can be effectively integrated with other noise addition methods to enhance performance. Our implementation and demonstrations are available at https://dgsna.github.io.
翻译:为确保语音系统在不同环境下的可靠运行,噪声添加方法已成为主流解决方案。然而,现有方法对真实世界噪声场景的覆盖范围有限,且依赖于预先存在的场景信息和噪声。本文提出基于提示的动态生成场景噪声添加方法(DGSNA),这是一种新颖的噪声添加方法,它将动态生成场景信息(DGSI)与基于场景的语音噪声添加(SNAS)相结合。这种集成通过将纯净语音转换为各种噪声环境,实现了自动化的场景噪声添加,从而为不同噪声条件提供了更全面、更真实的模拟。实验结果表明,DGSNA在各种噪声条件下显著增强了语音识别和关键词检测模型的鲁棒性,实现了高达11.21%的相对性能提升。此外,DGSNA可以有效地与其他噪声添加方法结合以进一步提升性能。我们的实现和演示可在 https://dgsna.github.io 获取。