Multi-modal large language models have garnered significant interest recently. Though, most of the works focus on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which humans interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for humans to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following dataset LLaSM-Audio-Instructions. Code and demo are available at https://github.com/LinkSoul-AI/LLaSM and https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions dataset is available at https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.
翻译:多模态大型语言模型近期引起了广泛关注。然而,现有研究大多聚焦于视觉-语言多模态模型,在遵循视觉与语言指令方面展现出强大能力。但我们认为,语音作为人类与世界交互的重要模态同样关键。因此,通用型助手必须具备遵循多模态语音-语言指令的能力。本研究提出大型语言与语音模型(LLaSM)。LLaSM是一种端到端训练的大型多模态语音-语言模型,具备跨模态对话能力,能够遵循语音与语言指令。初步实验表明,LLaSM为人类与人工智能的交互提供了更便捷、自然的方式。具体而言,我们还发布了大规模语音指令跟随数据集LLaSM-Audio-Instructions。代码与演示访问地址:https://github.com/LinkSoul-AI/LLaSM 及 https://huggingface.co/spaces/LinkSoul/LLaSM。LLaSM-Audio-Instructions数据集下载地址:https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions。