Recent advances in large language models (LLMs) have demonstrated human-expert-level capabilities, driving significant interest in their potential for achieving artificial general intelligence (AGI). In particular, there is growing momentum in adapting LLMs to various modalities, including vision, video, and speech, through the development of multimodal LLMs (MLLMs). However, existing speech-language model (SLM) research has largely overlooked cost-effective adaptation strategies for leveraging LLMs in the speech domain. In this paper, we propose FastSLM, a lightweight yet efficient SLM designed for effective understanding and reasoning over long-form speech. To address the challenge of aligning high-frame-rate speech features with LLMs, we introduce the Hierarchical Frame Querying Transformer (HFQ-Former), which compresses frame-level speech features while capturing both local and global context. Furthermore, we present a novel three-stage training strategy that enhances generalization across a wide range of speech-related tasks. Experimental results demonstrate that FastSLM achieves competitive performance compared to existing state-of-the-art models, despite operating with significantly lower FLOPs and parameter counts, while representing speech with only 1.67 tokens per second. The source code and model checkpoints are available at https://huggingface.co/okestro-ai-lab/FastSLM.
翻译:近年来,大型语言模型(LLM)的进展已展现出媲美人类专家的能力,这极大地推动了人们对其实现人工通用智能(AGI)潜力的关注。特别是,通过开发多模态大型语言模型(MLLM),将LLM适配到视觉、视频和语音等多种模态的研究势头日益强劲。然而,现有的语音语言模型(SLM)研究在很大程度上忽视了在语音领域利用LLM的成本效益适配策略。本文提出FastSLM,一种轻量级且高效的SLM,旨在实现对长语音的有效理解与推理。为应对高帧率语音特征与LLM对齐的挑战,我们引入了分层帧查询Transformer(HFQ-Former),它能在压缩帧级语音特征的同时,捕捉局部与全局上下文信息。此外,我们提出了一种新颖的三阶段训练策略,以增强模型在广泛语音相关任务上的泛化能力。实验结果表明,与现有最先进模型相比,FastSLM在仅以每秒1.67个令牌表示语音,且运算量(FLOPs)和参数量显著更低的情况下,仍能取得具有竞争力的性能。源代码与模型检查点可在 https://huggingface.co/okestro-ai-lab/FastSLM 获取。