In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.
翻译:摘要:本文中,我们并非直接预测像素级分割掩码,而是将指代图像分割问题建模为顺序多边形生成,所生成的多边形可后续转换为分割掩码。这一方法通过一种新的序列到序列框架——多边形变换器(PolyFormer)实现,该框架以图像块序列和文本查询词元为输入,并自回归地输出多边形顶点序列。为实现更精确的几何定位,我们提出了一种基于回归的解码器,可直接预测精确的浮点坐标,无需任何坐标量化误差。实验表明,PolyFormer在具有挑战性的RefCOCO+与RefCOCOg数据集上分别实现了5.40%和4.52%的绝对性能提升,显著优于现有技术。此外,在无需微调的情况下,将其应用于指代视频分割任务时,模型展现出强大的泛化能力,例如在Ref-DAVIS17数据集上取得了具有竞争力的61.5% J&F指标。