Although recent large multimodal models (LMMs) show impressive progress on vision language tasks, their alignment with human centered (HC) principles such as fairness, ethics, inclusivity, empathy, and robustness is often overlooked. Existing LMM benchmarks are largely accuracy-agnostic. We present HumaniBench, a unified framework for characterizing HC alignment across realistic, socially grounded visual contexts. It contains 32,000 expert-verified image-question pairs from real-world news imagery, each mapped to one or more HC principles through explicit metrics. Comparing 15 state of the art LMMs reveals consistent trade -offs: proprietary systems lead on ethics, reasoning, and empathy, while open-source models show superior visual grounding and resilience. All models show persistent gaps in fairness and multilingual inclusivity. Chain-of-thought prompting and test-time scaling yield 8to 12 % gains on several HC dimensions. HumaniBench enables fine-grained analysis of alignment trade-offs not captured by conventional multimodal benchmarks. https://vectorinstitute.github.io/humanibench/
翻译:暂无翻译