Multi-modal large language models have garnered significant interest recently. Though, most of the works focus on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which humans interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for humans to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following dataset LLaSM-Audio-Instructions. Code and demo are available at https://github.com/LinkSoul-AI/LLaSM and https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions dataset is available at https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.
翻译:多模态大语言模型近期引起了广泛关注。然而,现有工作大多聚焦于视觉-语言多模态模型,这些模型在遵循视觉与语言指令方面展现出强大能力。但我们认为,语音作为人类与世界交互的重要模态同样不可忽视。因此,通用型助手必须能够处理多模态的语音与语言指令。本研究提出大型语言与语音模型(LLaSM)。LLaSM是一个端到端训练的多模态语音-语言大模型,具备跨模态对话能力,可遵循语音与语言指令。早期实验表明,LLaSM为人类与人工智能的交互提供了更便捷、更自然的方式。特别地,我们还发布了大规模语音指令遵循数据集LLaSM-Audio-Instructions。代码和演示可访问 https://github.com/LinkSoul-AI/LLaSM 和 https://huggingface.co/spaces/LinkSoul/LLaSM;LLaSM-Audio-Instructions 数据集可于 https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions 获取。