In this paper, we introduce LLaVA-$\phi$ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller language models to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.
翻译:本文提出LLaVA-$\phi$(LLaVA-Phi),一种利用近期先进的小语言模型Phi-2实现多模态对话的高效多模态助手。LLaVA-Phi标志着紧凑型多模态模型领域的显著进步,证明即使参数规模小至2.7B的较小语言模型,在高质量语料库的训练下,也能有效参与融合文本与视觉元素的复杂对话。我们的模型在涵盖视觉理解、推理及知识感知的公开基准测试中展现出优异性能。除在多模态对话任务中的出色表现外,本模型还为具身代理等需要实时交互的时效敏感环境及系统开辟了新应用途径,凸显了小语言模型在保持更高资源效率的同时实现复杂理解与交互能力的潜力。项目地址:{https://github.com/zhuyiche/llava-phi}。