This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu-humane/VILLAIN.
翻译:本文介绍了 VILLAIN,一个通过基于提示的多智能体协作来验证图文声明的多模态事实核查系统。针对 AVerImaTeC 共享任务,VILLAIN 在事实核查的多个阶段部署了视觉-语言模型智能体。文本和视觉证据从知识库中检索,该知识库通过额外的网络收集进行了丰富。为了识别关键信息并解决证据项之间的不一致性,特定模态和跨模态智能体生成分析报告。在后续阶段,基于这些报告生成问答对。最后,裁决预测智能体根据图文声明和生成的问答对产生验证结果。我们的系统在所有评估指标上均位列排行榜第一。源代码公开于 https://github.com/ssu-humane/VILLAIN。