LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.
翻译:LLaVA-Plus是一种通用型多模态助手,可扩展大语言多模态模型的能力。它维护了一个包含预训练视觉及视觉-语言模型的技能库,并能根据用户输入激活相关工具以完成真实世界任务。LLaVA-Plus基于多模态指令跟随数据进行训练,从而习得使用工具的能力,涵盖视觉理解、生成、外部知识检索及组合操作。实验结果表明,LLaVA-Plus在现有能力上优于LLaVA,并展现出全新能力。其独特之处在于:图像查询直接锚定于交互过程,并在整个人机交互会话中主动参与,显著提升了工具使用性能并开启了新的应用场景。