In this paper, we introduce LLaVA-$\phi$ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller language models to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.
翻译:本文提出LLaVA-$\phi$(LLaVA-Phi),一种利用先进小型语言模型Phi-2驱动多模态对话的高效多模态助手。LLaVA-Phi标志着紧凑型多模态模型领域的显著进展,证明了在高质量语料库的训练条件下,参数规模低至2.7B的小型语言模型也能有效参与融合文本与视觉元素的复杂对话。我们的模型在涵盖视觉理解、推理及知识驱动的感知等公开基准测试中展现出值得称赞的性能。除多模态对话任务的卓越表现外,该模型为时敏环境及需要实时交互的系统(如具身智能体)开辟了新应用方向,揭示了小型语言模型在保持更高资源效率的同时实现深度理解与交互能力的潜力。项目地址:{https://github.com/zhuyiche/llava-phi}。