In this work we present a novel task of understanding unintentional human activities in videos. We formalize this problem as a reasoning task under zero-shot scenario, where given a video of an unintentional activity we want to know why it transitioned from intentional to unintentional. We first evaluate the effectiveness of current state-of-the-art Large Multimodal Models on this reasoning task and observe that they suffer from hallucination. We further propose a novel prompting technique,termed as Dream of Thoughts (DoT), which allows the model to navigate through hallucinated thoughts to achieve better reasoning. To evaluate the performance on this task, we also introduce three different specialized metrics designed to quantify the models reasoning capability. We perform our experiments on two different datasets, OOPs and UCF-Crimes, and our findings show that DOT prompting technique is able to outperform standard prompting, while minimizing hallucinations.
翻译:本文提出了一项新颖的任务:理解视频中的人类无意行为。我们将该问题形式化为零样本场景下的推理任务,即给定一段无意行为视频,需要推断其从有意行为转变为无意行为的原因。我们首先评估了当前最先进的大型多模态模型在此推理任务上的有效性,观察到它们存在幻觉问题。我们进一步提出了一种名为"思维之梦"(Dream of Thoughts, DoT)的新型提示技术,该技术使模型能够在各种幻觉想法中导航,实现更优的推理能力。为评估该任务上的性能,我们引入了三种专门设计的量化指标,用于衡量模型的推理能力。我们在两个不同数据集(OOPs和UCF-Crimes)上进行了实验,结果表明,DoT提示技术能够超越标准提示方法,同时有效减少幻觉现象。