SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

Ziyang Ma,Guanrou Yang,Wenxi Chen,Zhifu Gao,Yexing Du,Xiquan Li,Zhisheng Zheng,Haina Zhu,Jianheng Zhuo,Zheshu Song,Ruiyang Xu,Tiranrui Wang,Yifan Yang,Yanqiao Zhu,Zhikang Niu,Liumeng Xue,Yinghao Ma,Ruibin Yuan,Shiliang Zhang,Kai Yu,Eng Siong Chng,Xie Chen

from arxiv, Published in IEEE Journal of Selected Topics in Signal Processing (JSTSP)

The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.

翻译：近期涌现的开源多模态大语言模型框架（如LLaVA）为人工智能开发者与研究者提供了便捷的起点。然而，现有MLLM框架大多以视觉作为核心输入模态，对语音、音频及音乐模态的深度支持较为有限。这一现状制约了音频-语言模型的发展，迫使研究者耗费大量精力于代码编写与超参数调优。本文提出SLAM-LLM——一个专注于语音、语言、音频及音乐处理的开源深度学习框架，旨在支持定制化MLLM的训练。SLAM-LLM提供编码器、投影器、大语言模型及参数高效微调插件的模块化配置方案，同时包含主流任务的详细训练与推理方案，并提供了基于LLM的自动语音识别、自动音频描述与音乐描述等高性能模型检查点。部分方案已达到或接近当前最优性能，相关技术已被学术论文收录。我们期望SLAM-LLM能够加速研究者的迭代开发、数据工程与模型训练进程。我们将持续通过此开源框架推进音频多模态大语言模型的发展，并呼吁学术界共同致力于基于LLM的语音、音频与音乐处理研究。