We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD). Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR), a variation of MIM where a percentage of the entities/objects within a scene are masked and subsequently reconstructed based on the unmasked objects. The core idea is that, through object-level masked modeling, the network learns context-aware representations that capture the interaction of objects within a scene and thus are highly predictive of visual object relationships. We extensively evaluate learned representations, both qualitatively and quantitatively, in a few-shot setting and demonstrate the efficacy of MBBR for learning robust visual representations, particularly tailored for VRD. The proposed method is able to surpass state-of-the-art VRD methods on the Predicate Detection (PredDet) evaluation setting, using only a few annotated samples. We make our code available at https://github.com/deeplab-ai/SelfSupervisedVRD.
翻译:我们提出了一种新颖的自监督表征学习方法,专门针对视觉关系检测(VRD)任务。受掩码图像建模(MIM)有效性的启发,本文提出掩码边界框重建(MBBR)方法——这是MIM的一种变体,其中场景中一定比例的实体/对象被掩码,随后基于未被掩码的对象进行重建。核心思想在于:通过对象级掩码建模,网络能够学习到捕获场景中对象交互的上下文感知表征,从而对视觉对象关系具有高度预测性。我们通过小样本设置,从定性和定量两个维度对所学习的表征进行了广泛评估,证明了MBBR在针对VRD任务学习鲁棒视觉表征方面的有效性。所提出的方法仅需少量标注样本,即可在谓词检测(PredDet)评估设置中超越当前最先进的VRD方法。我们的代码已开源至https://github.com/deeplab-ai/SelfSupervisedVRD。