End-to-end automatic speech recognition (ASR) and large language models, such as Whisper and GPT-2, have recently been scaled to use vast amounts of training data. Despite the large amount of training data, infrequent content words that occur in a particular task may still exhibit poor ASR performance, with contextual biasing a possible remedy. This paper investigates the effectiveness of neural contextual biasing for Whisper combined with GPT-2. Specifically, this paper proposes integrating an adapted tree-constrained pointer generator (TCPGen) component for Whisper and a dedicated training scheme to dynamically adjust the final output without modifying any Whisper model parameters. Experiments across three datasets show a considerable reduction in errors on biasing words with a biasing list of 1000 words. Contextual biasing was more effective when applied to domain-specific data and can boost the performance of Whisper and GPT-2 without losing their generality.
翻译:摘要:端到端自动语音识别(ASR)和大型语言模型(如Whisper和GPT-2)近期已扩展至利用海量训练数据。尽管训练数据规模庞大,但特定任务中出现的低频内容词仍可能表现出较差的ASR性能,而上下文偏置是一种可能的补救措施。本文研究了神经上下文偏置在Whisper与GPT-2结合时的有效性。具体而言,本文提出为Whisper集成一种适配的树约束指针生成器(TCPGen)组件,并设计专用训练方案,在不修改任何Whisper模型参数的情况下动态调整最终输出。在三个数据集上的实验表明,使用包含1000个词的偏置列表可显著降低偏置词的识别错误。当应用于领域特定数据时,上下文偏置更为有效,且能在保持Whisper和GPT-2通用性的同时提升其性能。