Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions of autoregressive deliberation systems. We further show that the design of the denoising training allows PRoDeliberation to overcome the limitations of small ASR devices, and we provide analysis on the necessity of each component of the system.
翻译:口语理解是语音助手中的关键组件;其任务是将语音转换为语义解析以执行具体任务。先前的研究探索了利用审慎解码的端到端模型来提升口语理解模型的质量与鲁棒性,但这些模型仍采用自回归方式,导致延迟较高。本文提出PRoDeliberation,一种新颖的方法,它利用基于连接时序分类的解码策略以及去噪训练目标,来训练鲁棒的非自回归审慎解码模型。我们证明,PRoDeliberation在实现并行解码所带来的延迟降低(相比自回归模型有2-10倍的提升)的同时,保留了纠正自动语音识别错误转录的能力,该能力原本是自回归审慎解码系统的优势。我们进一步表明,去噪训练的设计使PRoDeliberation能够克服小型自动语音识别设备的局限性,并对系统中各组件的必要性提供了分析。