In this paper, we introduce LLaVA-$\phi$ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller language models to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.
翻译:本文介绍LLaVA-$\phi$(LLaVA-Phi),一种高效的多模态助手,其利用近期先进的小型语言模型Phi-2来促进多模态对话。LLaVA-Phi标志着紧凑型多模态模型领域的重要进展,其证明即使参数规模低至2.7B的小型语言模型,在高质量语料训练下,也能有效参与融合文本与视觉元素的复杂对话。我们的模型在涵盖视觉理解、推理及知识感知的公开基准测试中表现出色。除在多模态对话任务中的卓越性能外,本模型还为需要实时交互的时间敏感型系统(如具身智能体)的应用开辟了新路径,凸显了小型语言模型在保持更高资源效率的同时,达到复杂理解与交互水平的潜力。项目开源地址:{https://github.com/zhuyiche/llava-phi}。