RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.

翻译：关系型语言-图像预训练旨在将视觉表征与关系文本对齐，从而提升计算机视觉任务中的关系推理能力。然而，受限于RLIPv1架构的收敛速度缓慢以及现有场景图数据的稀缺性，RLIPv1的扩展面临挑战。本文提出RLIPv2——一种快速收敛的模型，能够将关系预训练扩展到大规模伪标注场景图数据。为实现快速扩展，RLIPv2引入非对称语言-图像融合机制，该机制通过稀疏化语言编码层实现更早且更深层的门控跨模态融合。ALIF在少量预训练与微调时间内即可达到与RLIPv1相当或更优的性能。为获取大规模场景图数据，我们通过引入图像描述器（如BLIP）和所设计的关联标注器，为物体检测数据集扩展自由形式的关系标签。关联标注器将BLIP生成的关系文本关联到区域对，从而实现更大规模的关系预训练。在人体-物体交互检测与场景图生成任务上的大量实验表明，RLIPv2在全量微调、少样本和零样本三种设定下，在三个基准上均达到最先进性能。值得注意的是，最大规模的RLIPv2在HICO-DET数据集上无需任何微调即可达到23.29mAP，仅用1%数据微调达到32.22mAP，而使用100%数据微调可达45.09mAP。代码与模型已开源至https://github.com/JacobYuan7/RLIPv2。