Interactions between humans are diverse and context-dependent, but previous works have treated them as categorical, disregarding the heavy tail of possible interactions. We propose a new paradigm of learning human-human interactions as free text from a single still image, allowing for flexibility in modeling the unlimited space of situations and relationships between people. To overcome the absence of data labelled specifically for this task, we use knowledge distillation applied to synthetic caption data produced by a large language model without explicit supervision. We show that the pseudo-labels produced by this procedure can be used to train a captioning model to effectively understand human-human interactions in images, as measured by a variety of metrics that measure textual and semantic faithfulness and factual groundedness of our predictions. We further show that our approach outperforms SOTA image captioning and situation recognition models on this task. We will release our code and pseudo-labels along with Waldo and Wenda, a manually-curated test set for still image human-human interaction understanding.
翻译:人与人之间的交互多样且依赖于上下文,但以往的研究将其视为分类问题,忽略了可能交互的长尾分布。我们提出一种新范式,从单张静态图像中学习以自由文本描述的人-人交互,从而灵活建模人与人之间关系和情境的无限空间。为克服缺乏专门为此任务标注数据的难题,我们利用知识蒸馏技术,将大语言模型在无显式监督下生成的合成标题数据转化为伪标签。实验表明,该流程生成的伪标签可训练出有效理解图像中人-人交互的标题生成模型,我们通过多种指标(衡量预测结果的文本与语义保真度及事实依据性)对此进行了验证。进一步证明,我们的方法在该任务上优于最新的图像标题生成和情境识别模型。我们将公开代码与伪标签,并发布Waldo与Wenda——一个用于静态图像人-人交互理解的人工筛选测试集。