Humans possess the capacity to reason about the future based on a sparse collection of visual cues acquired over time. In order to emulate this ability, we introduce a novel task called Anticipation Captioning, which generates a caption for an unseen oracle image using a sparsely temporally-ordered set of images. To tackle this new task, we propose a model called A-CAP, which incorporates commonsense knowledge into a pre-trained vision-language model, allowing it to anticipate the caption. Through both qualitative and quantitative evaluations on a customized visual storytelling dataset, A-CAP outperforms other image captioning methods and establishes a strong baseline for anticipation captioning. We also address the challenges inherent in this task.
翻译:人类具备根据随时间获取的稀疏视觉线索对未来进行推理的能力。为模拟这种能力,我们提出了一项名为“预期描述生成”的新任务,该任务利用一组稀疏时间排序的图像,为一张未见的未来图像生成描述性文字。为攻克这一新任务,我们提出名为A-CAP的模型,该模型将常识知识融入预训练的视觉-语言模型,使其能够预测未来图像描述。通过基于定制视觉故事数据集进行的定性与定量评估,A-CAP的表现超越了其他图像描述生成方法,并为预期描述生成任务建立了强有力的基准。此外,我们还针对该任务固有的挑战进行了探讨。