Panoptic Narrative Detection (PND) and Segmentation (PNS) are two challenging tasks that involve identifying and locating multiple targets in an image according to a long narrative description. In this paper, we propose a unified and effective framework called NICE that can jointly learn these two panoptic narrative recognition tasks. Existing visual grounding tasks use a two-branch paradigm, but applying this directly to PND and PNS can result in prediction conflict due to their intrinsic many-to-many alignment property. To address this, we introduce two cascading modules based on the barycenter of the mask, which are Coordinate Guided Aggregation (CGA) and Barycenter Driven Localization (BDL), responsible for segmentation and detection, respectively. By linking PNS and PND in series with the barycenter of segmentation as the anchor, our approach naturally aligns the two tasks and allows them to complement each other for improved performance. Specifically, CGA provides the barycenter as a reference for detection, reducing BDL's reliance on a large number of candidate boxes. BDL leverages its excellent properties to distinguish different instances, which improves the performance of CGA for segmentation. Extensive experiments demonstrate that NICE surpasses all existing methods by a large margin, achieving 4.1% for PND and 2.9% for PNS over the state-of-the-art. These results validate the effectiveness of our proposed collaborative learning strategy. The project of this work is made publicly available at https://github.com/Mr-Neko/NICE.
翻译:全景叙事检测(PND)与全景叙事分割(PNS)是两项具有挑战性的任务,涉及根据长篇叙事描述识别并定位图像中的多个目标。本文提出统一且高效的框架NICE,可联合学习这两类全景叙事识别任务。现有视觉定位任务采用双分支范式,但直接应用于PND和PNS会因两者固有的多对多对齐特性导致预测冲突。为解决此问题,我们引入基于掩码重心的两级联模块:坐标引导聚合(CGA)与重心驱动定位(BDL),分别负责分割与检测任务。通过以分割重心为锚点将PNS与PND串联,本方法自然对齐两项任务,使其相互补充以提升性能。具体而言,CGA为检测提供重心参考,降低BDL对大量候选框的依赖;BDL利用其优良特性区分不同实例,进而增强CGA的分割性能。大量实验表明,NICE以显著优势超越现有方法,在PND和PNS上分别以4.1%和2.9%的幅度超越当前最优技术。这些结果验证了所提协作学习策略的有效性。本工作项目已在https://github.com/Mr-Neko/NICE 开源。