Interactions between humans are diverse and context-dependent, but previous works have treated them as categorical, disregarding the heavy tail of possible interactions. We propose a new paradigm of learning human-human interactions as free text from a single still image, allowing for flexibility in modeling the unlimited space of situations and relationships between people. To overcome the absence of data labelled specifically for this task, we use knowledge distillation applied to synthetic caption data produced by a large language model without explicit supervision. We show that the pseudo-labels produced by this procedure can be used to train a captioning model to effectively understand human-human interactions in images, as measured by a variety of metrics that measure textual and semantic faithfulness and factual groundedness of our predictions. We further show that our approach outperforms SOTA image captioning and situation recognition models on this task. We will release our code and pseudo-labels along with Waldo and Wenda, a manually-curated test set for still image human-human interaction understanding.
翻译:人类之间的交互是多样且依赖于上下文的,但以往的研究将其视为类别化的,忽略了可能交互的长尾分布。我们提出一种新范式,从单张静态图像中学习以自由文本形式描述的人-人交互,从而灵活建模人际间情境与关系的无限空间。为克服缺乏针对该任务标注数据的难题,我们运用知识蒸馏技术,通过大型语言模型生成的合成字幕数据(无需显式监督)来训练模型。研究表明,该过程生成的伪标签可用于训练字幕模型,有效理解图像中的人-人交互——这一结论通过评估预测文本语义忠实度、事实准确性的多种指标得以验证。我们还证明,在此任务上,我们的方法优于当前最先进的图像字幕生成模型与场景识别模型。我们将公开代码、伪标签,以及人工标注的静态图像人-人交互理解测试集Waldo与Wenda。