Vision-Language Models (VLMs) continue to struggle to make morally salient judgments in multimodal and socially ambiguous contexts. Prior works typically rely on binary or pairwise supervision, which often fail to capture the continuous and pluralistic nature of human moral reasoning. We present MM-SCALE (Multimodal Moral Scale), a large-scale dataset for aligning VLMs with human moral preferences through 5-point scalar ratings and explicit modality grounding. Each image-scenario pair is annotated with moral acceptability scores and grounded reasoning labels by humans using an interface we tailored for data collection, enabling listwise preference optimization over ranked scenario sets. By moving from discrete to scalar supervision, our framework provides richer alignment signals and finer calibration of multimodal moral reasoning. Experiments show that VLMs fine-tuned on MM-SCALE achieve higher ranking fidelity and more stable safety calibration than those trained with binary signals.
翻译:视觉-语言模型(VLMs)在多模态及社会性模糊情境中做出具有道德显著性的判断时仍面临困难。现有研究通常依赖二元或成对监督,往往难以捕捉人类道德推理的连续性与多元性。我们提出了MM-SCALE(多模态道德标度)——一个通过五点标量评分与显式模态接地来对齐VLM与人类道德偏好的大规模数据集。每个图像-场景对均通过我们为数据收集定制的标注界面,由人工标注道德可接受性分数及接地推理标签,从而支持对排序场景集合进行列表式偏好优化。通过从离散监督转向标量监督,我们的框架提供了更丰富的对齐信号,并能对多模态道德推理进行更精细的校准。实验表明,基于MM-SCALE微调的VLMs相比使用二元信号训练的模型,实现了更高的排序保真度与更稳定的安全性校准。