We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.
翻译:我们提出了LLaVA-Critic,这是首个开源的大型多模态模型(LMM),被设计为通用评估器,用于评估广泛的多模态任务性能。LLaVA-Critic通过一个高质量的评价指令跟随数据集进行训练,该数据集融合了多样化的评估标准和场景。我们的实验证明了该模型在两个关键领域的有效性:(1)LMM即评委,LLaVA-Critic在此模式下提供可靠的评估分数,在多个评估基准上的表现与GPT模型相当或更优;(2)偏好学习,该模型为偏好学习生成奖励信号,从而增强模型的对齐能力。这项工作突显了开源LMM在自我批判和评估方面的潜力,为未来研究面向LMM的可扩展、超人类对齐反馈机制奠定了基础。