Interactions between humans are diverse and context-dependent, but previous works have treated them as categorical, disregarding the heavy tail of possible interactions. We propose a new paradigm of learning human-human interactions as free text from a single still image, allowing for flexibility in modeling the unlimited space of situations and relationships between people. To overcome the absence of data labelled specifically for this task, we use knowledge distillation applied to synthetic caption data produced by a large language model without explicit supervision. We show that the pseudo-labels produced by this procedure can be used to train a captioning model to effectively understand human-human interactions in images, as measured by a variety of metrics that measure textual and semantic faithfulness and factual groundedness of our predictions. We further show that our approach outperforms SOTA image captioning and situation recognition models on this task. We will release our code and pseudo-labels along with Waldo and Wenda, a manually-curated test set for still image human-human interaction understanding.
翻译:人与人之间的交互具有多样性和上下文依赖性,但以往的工作将其视为分类任务,忽略了可能交互的长尾分布。我们提出了一种新的范式,即从单张静态图像中以自由文本形式学习人人交互,从而灵活建模人与人之间场景和关系的无限空间。为解决缺乏针对此任务标注数据的问题,我们利用知识蒸馏技术,结合大型语言模型在无显式监督下生成的合成字幕数据进行训练。研究表明,该流程生成的伪标签可用于训练字幕生成模型,有效理解图像中的人人交互——这通过多项衡量预测文本语义忠实度、事实基础性的指标得到验证。我们进一步证明,在此任务上,我们的方法优于现有最先进的图像字幕生成和情景识别模型。我们将公开代码、伪标签,以及人工标注的静态图像人人交互理解测试集Waldo与Wenda。