ASR systems have become increasingly widespread in recent years. However, their textual outputs often require post-processing tasks before they can be practically utilized. To address this issue, we draw inspiration from the multifaceted capabilities of LLMs and Whisper, and focus on integrating multiple ASR text processing tasks related to speech recognition into the ASR model. This integration not only shortens the multi-stage pipeline, but also prevents the propagation of cascading errors, resulting in direct generation of post-processed text. In this study, we focus on ASR-related processing tasks, including Contextual ASR and multiple ASR post processing tasks. To achieve this objective, we introduce the CPPF model, which offers a versatile and highly effective alternative to ASR processing. CPPF seamlessly integrates these tasks without any significant loss in recognition performance.
翻译:摘要:近年来,自动语音识别系统日益普及。然而,其文本输出在实际应用前往往需要后处理步骤。为解决这一问题,我们从大语言模型与语音识别模型的多功能特性中汲取灵感,致力于将与语音识别相关的多项文本处理任务融合至ASR模型中。这种融合不仅缩短了多阶段处理流程,还能避免级联误差的传播,从而直接生成后处理后的文本。本研究重点关注与ASR相关的处理任务,包括上下文感知ASR及多项ASR后处理任务。为实现这一目标,我们提出了CPPF模型,该模型为ASR处理提供了一种通用且高效的新方案。CPPF能够无缝整合这些任务,且几乎不损失识别性能。