Prompts are crucial to large language models as they provide context information such as topic or logical relationships. Inspired by this, we propose PromptASR, a framework that integrates prompts in end-to-end automatic speech recognition (E2E ASR) systems to achieve contextualized ASR with controllable style of transcriptions. Specifically, a dedicated text encoder encodes the text prompts and the encodings are injected into the speech encoder by cross-attending the features from two modalities. When using the ground truth text from preceding utterances as content prompt, the proposed system achieves 21.9% and 6.8% relative word error rate reductions on a book reading dataset and an in-house dataset compared to a baseline ASR system. The system can also take word-level biasing lists as prompt to improve recognition accuracy on rare words. An additional style prompt can be given to the text encoder and guide the ASR system to output different styles of transcriptions. The code is available at icefall.
翻译:提示信息对大型语言模型至关重要,因为它们能提供主题或逻辑关系等上下文信息。受此启发,我们提出PromptASR框架,该框架将提示信息集成到端到端自动语音识别(E2E ASR)系统中,实现具有可控转录风格的上下文感知ASR。具体而言,专用文本编码器对文本提示进行编码,并通过跨模态特征交叉注意力将编码注入语音编码器。当使用前序话语的真实文本作为内容提示时,相较于基线ASR系统,本系统在图书朗读数据集和内部数据集上分别实现了21.9%和6.8%的相对词错误率降低。该系统还能将词级偏向列表作为提示,提升罕见词的识别准确率。此外,可向文本编码器提供额外风格提示,引导ASR系统输出不同风格的转录文本。相关代码已开源至icefall。