In this paper, we propose a new visual reasoning task, called Visual Transformation Telling (VTT). This task requires a machine to describe the transformation that occurred between every two adjacent states (i.e. images) in a series. Unlike most existing visual reasoning tasks that focus on state reasoning, VTT emphasizes transformation reasoning. We collected 13,547 samples from two instructional video datasets, CrossTask and COIN, and extracted desired states and transformation descriptions to create a suitable VTT benchmark dataset. Humans can naturally reason from superficial states differences (e.g. ground wetness) to transformations descriptions (e.g. raining) according to their life experience but how to model this process to bridge this semantic gap is challenging. We designed TTNet on top of existing visual storytelling models by enhancing the model's state-difference sensitivity and transformation-context awareness. TTNet significantly outperforms other baseline models adapted from similar tasks, such as visual storytelling and dense video captioning, demonstrating the effectiveness of our modeling on transformations. Through comprehensive diagnostic analyses, we found TTNet has strong context utilization abilities, but even with some state-of-the-art techniques such as CLIP, there remain challenges in generalization that need to be further explored.
翻译:本文提出一种新的视觉推理任务,名为“视觉变换描述”(Visual Transformation Telling, VTT)。该任务要求机器描述一系列相邻状态(即图像)之间发生的变换。与大多数关注状态推理的现有视觉推理任务不同,VTT强调变换推理。我们从两个教学视频数据集CrossTask和COIN中收集了13,547个样本,并提取期望的状态与变换描述,构建了适用于VTT的基准数据集。人类能够依据生活经验,从表面状态差异(如地面湿润)自然推理出变换描述(如下雨),但如何建模这一过程以弥合语义鸿沟仍具有挑战性。我们在现有视觉故事讲述模型基础上设计了TTNet,通过增强模型的状态差异敏感性与变换上下文感知能力。TTNet显著优于从相似任务(如视觉故事讲述与密集视频描述)改造的其他基线模型,验证了针对变换建模的有效性。通过全面的诊断分析,我们发现TTNet具备强大的上下文利用能力,但即使采用CLIP等前沿技术,其泛化能力仍存在挑战,有待进一步探索。