Scene Graph Generation (SGG) aims to extract <subject, predicate, object> relationships in images for vision understanding. Although recent works have made steady progress on SGG, they still suffer long-tail distribution issues that tail-predicates are more costly to train and hard to distinguish due to a small amount of annotated data compared to frequent predicates. Existing re-balancing strategies try to haddle it via prior rules but are still confined to pre-defined conditions, which are not scalable for various models and datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates in a low-resource way. The proposed CaCao can be applied in a plug-and-play fashion and automatically strengthen existing SGG to tackle the long-tailed problem. Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner. Comprehensive experiments on three benchmark datasets show that CaCao consistently boosts the performance of multiple scene graph generation models in a model-agnostic way. Moreover, our Epic achieves competitive performance on open-world predicate prediction.
翻译:场景图生成(SGG)旨在从图像中提取<主体,谓词,客体>关系以支持视觉理解。尽管近期工作在SGG上取得了稳步进展,但仍面临长尾分布问题:相较于高频谓词,尾部谓词因标注数据量少而导致训练成本更高且难以区分。现有重平衡策略试图通过先验规则处理该问题,但仍局限于预定义条件,难以扩展至不同模型与数据集。本文提出跨模态谓词增强框架CaCao,通过视觉提示语言模型以低资源方式生成多样化细粒度谓词。所提CaCao即插即用,可自动增强现有SGG以解决长尾问题。在此基础上,我们进一步提出新型纠缠跨模态提示方法Epic,用于开放世界谓词场景图生成,使模型能以零样本方式泛化至未见谓词。在三个基准数据集上的综合实验表明,CaCao以模型无关方式持续提升多个场景图生成模型的性能。此外,我们的Epic在开放世界谓词预测任务中取得了具有竞争力的结果。