We propose Subject-Conditional Relation Detection SCoRD, where conditioned on an input subject, the goal is to predict all its relations to other objects in a scene along with their locations. Based on the Open Images dataset, we propose a challenging OIv6-SCoRD benchmark such that the training and testing splits have a distribution shift in terms of the occurrence statistics of $\langle$subject, relation, object$\rangle$ triplets. To solve this problem, we propose an auto-regressive model that given a subject, it predicts its relations, objects, and object locations by casting this output as a sequence of tokens. First, we show that previous scene-graph prediction methods fail to produce as exhaustive an enumeration of relation-object pairs when conditioned on a subject on this benchmark. Particularly, we obtain a recall@3 of 83.8% for our relation-object predictions compared to the 49.75% obtained by a recent scene graph detector. Then, we show improved generalization on both relation-object and object-box predictions by leveraging during training relation-object pairs obtained automatically from textual captions and for which no object-box annotations are available. Particularly, for $\langle$subject, relation, object$\rangle$ triplets for which no object locations are available during training, we are able to obtain a recall@3 of 33.80% for relation-object pairs and 26.75% for their box locations.
翻译:我们提出主体条件关系检测(SCoRD)任务,其目标是在给定输入主体条件下,预测场景中该主体与所有其他对象之间的全部关系及其对应位置。基于Open Images数据集,我们构建了具有挑战性的OIv6-SCoRD基准测试,其训练集与测试集在〈主体,关系,对象〉三元组出现统计分布上存在偏移。为解决该问题,我们提出自回归模型:给定主体后,通过将输出编码为标记序列的方式,同时预测其关系、对象及对象位置。首先,实验表明,在该基准测试中,现有场景图预测方法无法在主体条件下生成足够完备的关系-对象对枚举。具体而言,我们的关系-对象预测召回率@3达83.8%,而近期场景图检测器仅达49.75%。其次,通过利用训练阶段从文本描述中自动获取的无对象框标注的关系-对象对,我们证明了模型在关系-对象预测与对象框预测上的泛化能力提升。特别地,对于训练阶段无位置标注的〈主体,关系,对象〉三元组,我们仍能取得33.80%的关系-对象对召回率@3及26.75%的边界框定位召回率@3。