Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner. Experimental results demonstrate the efficacy of our proposed methods in endowing LLMs with directional speech understanding capabilities, achieving strong performance in both speech recognition and speech translation tasks.
翻译:近期研究表明,通过音频编码提示大型语言模型(LLM)可使其具备有效的语音理解能力。然而,现有语音LLM大多基于单通道、单说话人数据进行训练,难以直接应用于多说话人及多通道语音理解任务。本研究针对智能眼镜应用场景,系统性地探讨了如何为LLM赋予定向多说话人语音理解能力。我们提出两种整合方向感知的新方法:(1)采用源分离前端模块的级联系统;(2)基于序列化输出训练的端到端系统。所有方法均利用嵌入智能眼镜的多麦克风阵列,以流式处理方式优化方向性解析与处理。实验结果表明,所提方法能有效赋予LLM定向语音理解能力,在语音识别与语音翻译任务中均取得优异性能。