Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.
翻译:近年来,视觉语言模型在开放词汇语义与部件分割领域取得的进展引起了广泛关注。然而,现有方法通过空间与类别聚合的串行结构从代价卷中提取图文对齐线索,导致类别级语义与空间上下文知识相互干扰。为此,本文提出一种简单而有效的并行代价聚合范式以缓解上述问题,使模型能够从代价卷中捕获更丰富的视觉语言对齐信息。具体而言,我们设计了专家驱动感知学习模块,该模块高效整合语义流与上下文流,通过多专家解析器从多视角提取互补特征。此外,模块还包含系数映射器,可自适应学习每个特征的像素级权重,从而将互补知识整合为统一且鲁棒的特征嵌入。进一步地,我们提出特征正交化解耦策略以降低语义流与上下文流间的冗余度,使专家驱动感知学习模块能够从正交化特征中学习多样化知识。在八个基准数据集上的大量实验表明,PCA-Seg中每个并行模块仅增加0.35M参数,即可实现最先进的开放词汇语义与部件分割性能。