We introduce ECHo (Event Causality Inference via Human-Centric Reasoning), a diagnostic dataset of event causality inference grounded in visio-linguistic social scenarios. ECHo employs real-world human-centric deductive information building on a television crime drama. ECHo requires the Theory-of-Mind (ToM) ability to understand and reason about social interactions based on multimodal information. Using ECHo, we propose a unified Chain-of-Thought (CoT) framework to assess the reasoning capability of current AI systems. Our ToM-enhanced CoT pipeline accommodates various large foundation models in both zero-shot and few-shot visio-linguistic reasoning. We use this framework to scrutinize recent large foundation models such as InstructGPT and MiniGPT-4 on three diagnostic human-centric tasks. Further analysis demonstrates ECHo as a challenging dataset to expose imperfections and inconsistencies in reasoning. Our data and code are publicly available at https://github.com/YuxiXie/ECHo.
翻译:我们提出ECHo(基于人类中心推理的事件因果推断),一个基于视觉-语言社会场景的事件因果推断诊断性数据集。ECHo利用基于电视犯罪剧的真实世界人类中心演绎信息。ECHo要求具备心智理论(ToM)能力,以基于多模态信息理解并推理社会互动。借助ECHo,我们提出一个统一的思维链(CoT)框架,用于评估当前AI系统的推理能力。我们的ToM增强CoT流水线可适配多种大型基础模型,支持零样本和少样本视觉-语言推理。我们使用该框架在三个诊断性人类中心任务上审查了InstructGPT和MiniGPT-4等近期大型基础模型。进一步分析表明,ECHo是一个具有挑战性的数据集,能够暴露推理中的缺陷和不一致性。我们的数据和代码开源在https://github.com/YuxiXie/ECHo。