Prompts are crucial to large language models as they provide context information such as topic or logical relationships. Inspired by this, we propose PromptASR, a framework that integrates prompts in end-to-end automatic speech recognition (E2E ASR) systems to achieve contextualized ASR with controllable style of transcriptions. Specifically, a dedicated text encoder encodes the text prompts and the encodings are injected into the speech encoder by cross-attending the features from two modalities. When using the ground truth text from preceding utterances as content prompt, the proposed system achieves 21.9% and 6.8% relative word error rate reductions on a book reading dataset and an in-house dataset compared to a baseline ASR system. The system can also take word-level biasing lists as prompt to improve recognition accuracy on rare words. An additional style prompt can be given to the text encoder and guide the ASR system to output different styles of transcriptions. The code is available at icefall.
翻译:提示词对大型语言模型至关重要,因其可提供主题或逻辑关系等上下文信息。受此启发,我们提出PromptASR框架,该框架将提示词集成至端到端自动语音识别系统中,以实现具备可控转录风格的上下文感知语音识别。具体而言,专用文本编码器对文本提示词进行编码,并通过跨模态特征交叉注意力机制将编码结果注入语音编码器。当使用前序话语的真实文本作为内容提示词时,相较于基线语音识别系统,本系统在书籍朗读数据集与内部数据集上分别取得21.9%和6.8%的相对词错误率降低。该系统还可将词级偏置列表作为提示词,以提升罕见词识别准确率。额外风格提示词可输入文本编码器,引导语音识别系统输出不同转录风格。代码已开源至icefall平台。