This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues. While most large vision-language models (VLMs) focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,062 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values.
翻译:本文提出VIVA,一个基于人类价值观的视觉决策基准。尽管大多数大型视觉语言模型关注物理层面的技能,我们的工作首次考察其在视觉描述情境下利用人类价值观进行决策的多模态能力。VIVA包含1,062张描绘多样化现实情境的图像及基于这些情境的人工标注决策。给定一张图像,模型需选择最合适的行动以应对该情境,并提供决策背后相关的人类价值观与推理依据。基于VIVA的大量实验表明,现有视觉语言模型在利用人类价值观进行多模态决策方面存在局限。进一步分析揭示了利用行动后果与预测人类价值观的潜在优势。