A serious issue that harms the performance of zero-shot visual recognition is named objective misalignment, i.e., the learning objective prioritizes improving the recognition accuracy of seen classes rather than unseen classes, while the latter is the true target to pursue. This issue becomes more significant in zero-shot image segmentation because the stronger (i.e., pixel-level) supervision brings a larger gap between seen and unseen classes. To mitigate it, we propose a novel architecture named AlignZeg, which embodies a comprehensive improvement of the segmentation pipeline, including proposal extraction, classification, and correction, to better fit the goal of zero-shot segmentation. (1) Mutually-Refined Proposal Extraction. AlignZeg harnesses a mutual interaction between mask queries and visual features, facilitating detailed class-agnostic mask proposal extraction. (2) Generalization-Enhanced Proposal Classification. AlignZeg introduces synthetic data and incorporates multiple background prototypes to allocate a more generalizable feature space. (3) Predictive Bias Correction. During the inference stage, AlignZeg uses a class indicator to find potential unseen class proposals followed by a prediction postprocess to correct the prediction bias. Experiments demonstrate that AlignZeg markedly enhances zero-shot semantic segmentation, as shown by an average 3.8% increase in hIoU, primarily attributed to a 7.1% improvement in identifying unseen classes, and we further validate that the improvement comes from alleviating the objective misalignment issue.
翻译:零样本视觉识别性能受损的一个严重问题是目标不一致,即学习目标优先提升可见类的识别准确率而非不可见类,而后者才是真正的追求目标。这一问题在零样本图像分割中更为突出,因为更强的像素级监督导致了可见类与不可见类之间更大的差距。为缓解该问题,我们提出了一种名为AlignZeg的新型架构,该架构对分割流程进行了全面改进,包括候选框提取、分类和校正,以更好地适配零样本分割的目标。(1)互细化候选框提取。AlignZeg利用掩码查询与视觉特征之间的交互,促进详细的类别无关掩码候选框提取。(2)泛化增强的候选框分类。AlignZeg引入合成数据并整合多个背景原型,以分配更具泛化性的特征空间。(3)预测偏差校正。在推理阶段,AlignZeg使用类别指示器定位潜在的不可见类候选框,随后通过预测后处理校正预测偏差。实验表明,AlignZeg显著提升了零样本语义分割性能,平均hIoU提高了3.8%,主要归因于不可见类识别准确率提升了7.1%,我们进一步验证了该改进源于目标不一致问题的缓解。