Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal Large Language Models (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining hhh dimensions in the visual world and the difficulty in collecting relevant data that accurately mirrors real-world situations. To address this gap, we introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations. Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle. We also present a unified evaluation strategy supporting assessment across various scenarios and different perspectives. Based on the evaluation results, we summarize over 10 key findings that deepen the understanding of MLLM capabilities, limitations, and the dynamic relationships between evaluation levels, guiding future advancements in the field.
翻译:大语言模型旨在成为与人类价值观对齐的通用助手,其核心原则包括有用性、诚实性和无害性(hhh)。然而,对于多模态大语言模型而言,尽管其在感知与推理任务中表现出色,但由于在视觉世界中定义hhh维度的复杂性,以及难以收集准确反映真实情境的相关数据,其与人类价值观的对齐程度在很大程度上尚未被探索。为弥补这一空白,我们引入了Ch3Ef——一个用于评估与人类期望对齐程度的综合数据集与评估策略。Ch3Ef数据集包含1002个人工标注数据样本,覆盖基于hhh原则的12个领域和46项任务。我们还提出了一种统一评估策略,支持跨不同场景和视角的评估。基于评估结果,我们总结了十多条关键发现,加深了对多模态大语言模型能力、局限性以及评估层级间动态关系的理解,为该领域的未来发展提供指导。