Interactions between humans are diverse and context-dependent, but previous works have treated them as categorical, disregarding the heavy tail of possible interactions. We propose a new paradigm of learning human-human interactions as free text from a single still image, allowing for flexibility in modeling the unlimited space of situations and relationships between people. To overcome the absence of data labelled specifically for this task, we use knowledge distillation applied to synthetic caption data produced by a large language model without explicit supervision. We show that the pseudo-labels produced by this procedure can be used to train a captioning model to effectively understand human-human interactions in images, as measured by a variety of metrics that measure textual and semantic faithfulness and factual groundedness of our predictions. We further show that our approach outperforms SOTA image captioning and situation recognition models on this task. We will release our code and pseudo-labels along with Waldo and Wenda, a manually-curated test set for still image human-human interaction understanding.
翻译:人与人之间的交互具有多样性和情境依赖性,但先前的研究将其视为离散类别,忽略了可能交互的长尾分布。我们提出了一种从单张静态图像中学习以自由文本形式表征人与人交互的新范式,从而灵活建模人际情境与关系的无限空间。为克服缺乏专为此任务标注数据的难题,我们利用知识蒸馏技术,将大型语言模型在无显式监督条件下合成的标题数据作为训练资源。实验表明,该方法生成的伪标签可用于训练标题生成模型,使其有效理解图像中的人与人交互行为——通过文本语义忠实度、事实准确性等多种指标验证。我们进一步证明,本方法在该任务上优于现有最先进的图像描述和场景识别模型。我们将公开代码、伪标签,以及面向静态图像人际交互理解的人工标注测试集Waldo与Wenda。