Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-structure representation based on panoptic segmentation masks. Despite remarkable progress in PSG, almost all existing methods neglect the importance of shape-aware features, which inherently focus on the contours and boundaries of objects. To bridge this gap, we propose a model-agnostic Curricular shApe-aware FEature (CAFE) learning strategy for PSG. Specifically, we incorporate shape-aware features (i.e., mask features and boundary features) into PSG, moving beyond reliance solely on bbox features. Furthermore, drawing inspiration from human cognition, we propose to integrate shape-aware features in an easy-to-hard manner. To achieve this, we categorize the predicates into three groups based on cognition learning difficulty and correspondingly divide the training process into three stages. Each stage utilizes a specialized relation classifier to distinguish specific groups of predicates. As the learning difficulty of predicates increases, these classifiers are equipped with features of ascending complexity. We also incorporate knowledge distillation to retain knowledge acquired in earlier stages. Due to its model-agnostic nature, CAFE can be seamlessly incorporated into any PSG model. Extensive experiments and ablations on two PSG tasks under both robust and zero-shot PSG have attested to the superiority and robustness of our proposed CAFE, which outperforms existing state-of-the-art methods by a large margin.
翻译:全景场景图生成(PSG)旨在基于全景分割掩码生成全面的图结构表示。尽管PSG领域取得了显著进展,但几乎所有现有方法都忽视了形状感知特征的重要性,这些特征本质上关注物体的轮廓和边界。为弥补这一空白,我们提出了一种与模型无关的课程化形状感知特征(CAFE)学习策略用于PSG。具体而言,我们将形状感知特征(即掩码特征和边界特征)融入PSG,超越仅依赖边界框特征的传统做法。进一步地,受人类认知过程启发,我们提出以从易到难的方式整合形状感知特征。为实现这一目标,我们根据认知学习难度将谓词分为三组,并相应地将训练过程划分为三个阶段。每个阶段采用专门的关系分类器来区分特定组的谓词。随着谓词学习难度的增加,这些分类器被赋予复杂度递增的特征。我们还引入知识蒸馏技术以保留早期阶段习得的知识。得益于其与模型无关的特性,CAFE可以无缝集成到任何PSG模型中。在鲁棒性和零样本PSG两种任务上的大量实验与消融研究证实了我们提出的CAFE的优越性和鲁棒性,其性能大幅超越现有最先进方法。