Panoptic Narrative Detection (PND) and Segmentation (PNS) are two challenging tasks that involve identifying and locating multiple targets in an image according to a long narrative description. In this paper, we propose a unified and effective framework called NICE that can jointly learn these two panoptic narrative recognition tasks. Existing visual grounding tasks use a two-branch paradigm, but applying this directly to PND and PNS can result in prediction conflict due to their intrinsic many-to-many alignment property. To address this, we introduce two cascading modules based on the barycenter of the mask, which are Coordinate Guided Aggregation (CGA) and Barycenter Driven Localization (BDL), responsible for segmentation and detection, respectively. By linking PNS and PND in series with the barycenter of segmentation as the anchor, our approach naturally aligns the two tasks and allows them to complement each other for improved performance. Specifically, CGA provides the barycenter as a reference for detection, reducing BDL's reliance on a large number of candidate boxes. BDL leverages its excellent properties to distinguish different instances, which improves the performance of CGA for segmentation. Extensive experiments demonstrate that NICE surpasses all existing methods by a large margin, achieving 4.1% for PND and 2.9% for PNS over the state-of-the-art. These results validate the effectiveness of our proposed collaborative learning strategy. The project of this work is made publicly available at https://github.com/Mr-Neko/NICE.
翻译:全景叙事检测(PND)与全景叙事分割(PNS)是两项具有挑战性的任务,涉及根据长叙事描述在图像中识别并定位多个目标。本文提出一种名为NICE的统一高效框架,可联合学习这两项全景叙事识别任务。现有视觉定位任务采用双分支范式,但由于PND和PNS固有的多对多对齐特性,直接应用该范式会导致预测冲突。为解决此问题,我们基于掩膜质心引入了两个级联模块:坐标引导聚合模块(CGA)和质心驱动定位模块(BDL),分别负责分割与检测任务。通过将PNS与PND以分割质心为锚点串联,我们的方法天然对齐了两项任务,使其相互补充以提升性能。具体而言,CGA为检测提供质心参考,减少了BDL对大量候选框的依赖;BDL利用其优良特性区分不同实例,从而提升CGA的分割性能。大量实验表明,NICE以显著优势超越现有方法,在PND和PNS任务上分别比现有最佳方法提升4.1%和2.9%。这些结果验证了所提协作学习策略的有效性。本工作项目已公开于https://github.com/Mr-Neko/NICE。