ASR systems have become increasingly widespread in recent years. However, their textual outputs often require post-processing tasks before they can be practically utilized. To address this issue, we draw inspiration from the multifaceted capabilities of LLMs and Whisper, and focus on integrating multiple ASR text processing tasks related to speech recognition into the ASR model. This integration not only shortens the multi-stage pipeline, but also prevents the propagation of cascading errors, resulting in direct generation of post-processed text. In this study, we focus on ASR-related processing tasks, including Contextual ASR and multiple ASR post processing tasks. To achieve this objective, we introduce the CPPF model, which offers a versatile and highly effective alternative to ASR processing. CPPF seamlessly integrates these tasks without any significant loss in recognition performance.
翻译:近年来,自动语音识别(ASR)系统日益普及。然而,其文本输出在实际应用前通常需要进行后处理任务。为解决这一问题,我们从LLM和Whisper的多功能特性中汲取灵感,致力于将与语音识别相关的多种ASR文本处理任务整合至ASR模型中。这种整合不仅缩短了多阶段流水线,还能防止级联误差的传播,从而直接生成后处理文本。本研究聚焦于与ASR相关的处理任务,包括上下文感知ASR及多项ASR后处理任务。为实现此目标,我们提出了CPPF模型,为ASR处理提供了一种通用且高效的替代方案。CPPF能够无缝整合这些任务,且在识别性能上无显著损失。