Detecting bias in multimodal news requires models that reason over text--image pairs, not just classify text. In response, we present ViLBias, a VQA-style benchmark and framework for detecting and reasoning about bias in multimodal news. The dataset comprises 40,945 text--image pairs from diverse outlets, each annotated with a bias label and concise rationale using a two-stage LLM-as-annotator pipeline with hierarchical majority voting and human-in-the-loop validation. We evaluate Small Language Models (SLMs), Large Language Models (LLMs), and Vision--Language Models (VLMs) across closed-ended classification and open-ended reasoning (oVQA), and compare parameter-efficient tuning strategies. Results show that incorporating images alongside text improves detection accuracy by 3--5\%, and that LLMs/VLMs better capture subtle framing and text--image inconsistencies than SLMs. Parameter-efficient methods (LoRA/QLoRA/Adapters) recover 97--99\% of full fine-tuning performance with $<5\%$ trainable parameters. For oVQA, reasoning accuracy spans 52--79\% and faithfulness 68--89\%, both improved by instruction tuning; closed accuracy correlates strongly with reasoning ($r = 0.91$). ViLBias offers a scalable benchmark and strong baselines for multimodal bias detection and rationale quality.
翻译:检测多模态新闻中的偏见需要能够对文本-图像对进行推理的模型,而不仅仅是文本分类。为此,我们提出了ViLBias,这是一个用于检测和推理多模态新闻偏见的VQA风格基准测试与框架。该数据集包含来自不同媒体的40,945个文本-图像对,每个样本均通过一个两阶段的LLM作为标注器的流程(采用分层多数投票和人在回路验证)进行了偏见标签和简明理由的标注。我们评估了小型语言模型(SLMs)、大型语言模型(LLMs)和视觉-语言模型(VLMs)在封闭式分类和开放式推理(oVQA)任务上的表现,并比较了参数高效微调策略。结果表明,结合图像与文本可将检测准确率提高3-5%,并且LLMs/VLMs比SLMs更能捕捉细微的框架设定和文本-图像不一致性。参数高效方法(LoRA/QLoRA/Adapters)仅使用$<5\%$的可训练参数即可恢复全参数微调性能的97-99%。在oVQA任务中,推理准确率在52-79%之间,忠实度在68-89%之间,两者均可通过指令微调得到提升;封闭式分类准确率与推理能力高度相关($r = 0.91$)。ViLBias为多模态偏见检测及其理由质量评估提供了一个可扩展的基准和强有力的基线。