Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks -- ranging from transcript manipulation to summarization -- without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring LLMs or pre-trained speech modules. It also offers selective transcription options based on instructions like "transcribe first half and then turn off listening," providing an additional layer of privacy and safety compared to existing LLMs. Our findings highlight the significant potential of instruction-following training to advance speech foundation models.
翻译:传统端到端自动语音识别(ASR)模型主要专注于精确转录任务,缺乏对细微用户交互的灵活性。随着大型语言模型(LLMs)在语音处理领域的出现,基于文本提示的更有机交互成为可能。然而,这些模型语音理解和“推理”能力背后的机制仍未得到充分探索。为了从数据角度研究这一问题,我们引入了指令跟随语音识别,训练一个Listen-Attend-Spell模型以理解并执行多样化的自由形式文本指令。这使得多种语音识别任务——从转录操作到摘要生成——无需依赖预定义指令集即可实现。值得注意的是,我们的模型在Librispeech上从头训练,无需LLMs或预训练语音模块即可解释并执行简单指令。此外,根据诸如“转录前半部分然后关闭监听”的指令,它还能提供选择性转录选项,与现有LLMs相比增加了隐私和安全性。我们的发现凸显了指令跟随训练在推进语音基础模型方面的巨大潜力。