Recent advancements in large language models (LLMs) have revolutionized various domains, bringing significant progress and new opportunities. Despite progress in speech-related tasks, LLMs have not been sufficiently explored in multi-talker scenarios. In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. These representations are then fed into an LLM fine-tuned using LoRA, enabling the capabilities for speech comprehension and transcription. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios, highlighting the potential of LLM to handle speech-related tasks based on user instructions in such complex settings. The code, model, and samples are available at https://github.com/cuhealthybrains/MT-LLM.
翻译:近年来,大型语言模型(LLMs)的进展已彻底改变多个领域,带来了显著进步与新的机遇。尽管在语音相关任务中已取得进展,但LLMs在多人对话场景中的探索仍显不足。本研究率先探索了LLMs在多人对话环境中转录语音的能力,该模型可遵循与多人自动语音识别(ASR)、目标说话人ASR,以及基于特定说话人属性(如性别、出现顺序、语言及所说关键词)的ASR相关的多样化指令。我们的方法利用WavLM与Whisper编码器提取对说话人特征和语义语境敏感的多层面语音表征。这些表征随后输入到通过LoRA微调的LLM中,从而赋予模型语音理解与转录的能力。综合实验表明,我们提出的系统MT-LLM在鸡尾酒会场景中表现出优异的性能,凸显了LLM在此类复杂环境下基于用户指令处理语音相关任务的潜力。代码、模型及示例已公开于https://github.com/cuhealthybrains/MT-LLM。