More than the Sum of Its Parts: Ensembling Backbone Networks for Few-Shot Segmentation

Semantic segmentation is a key prerequisite to robust image understanding for applications in \acrlong{ai} and Robotics. \acrlong{fss}, in particular, concerns the extension and optimization of traditional segmentation methods in challenging conditions where limited training examples are available. A predominant approach in \acrlong{fss} is to rely on a single backbone for visual feature extraction. Choosing which backbone to leverage is a deciding factor contributing to the overall performance. In this work, we interrogate on whether fusing features from different backbones can improve the ability of \acrlong{fss} models to capture richer visual features. To tackle this question, we propose and compare two ensembling techniques-Independent Voting and Feature Fusion. Among the available \acrlong{fss} methods, we implement the proposed ensembling techniques on PANet. The module dedicated to predicting segmentation masks from the backbone embeddings in PANet avoids trainable parameters, creating a controlled `in vitro' setting for isolating the impact of different ensembling strategies. Leveraging the complementary strengths of different backbones, our approach outperforms the original single-backbone PANet across standard benchmarks even in challenging one-shot learning scenarios. Specifically, it achieved a performance improvement of +7.37\% on PASCAL-5\textsuperscript{i} and of +10.68\% on COCO-20\textsuperscript{i} in the top-performing scenario where three backbones are combined. These results, together with the qualitative inspection of the predicted subject masks, suggest that relying on multiple backbones in PANet leads to a more comprehensive feature representation, thus expediting the successful application of \acrlong{fss} methods in challenging, data-scarce environments.

翻译：语义分割是人工智能和机器人应用中实现鲁棒图像理解的关键前提。尤其对于小样本语义分割（FSS），其关注的是在训练样本有限的挑战性条件下对传统分割方法进行扩展和优化。当前FSS领域的主流方法依赖单一骨干网络进行视觉特征提取，而选择何种骨干网络成为影响整体性能的决定性因素。本研究质疑融合不同骨干网络特征能否提升FSS模型捕获更丰富视觉特征的能力。针对该问题，我们提出并比较了两种集成技术——独立投票与特征融合。在现有FSS方法中，我们选择PANet作为载体实现所提出的集成技术。PANet中用于从骨干网络嵌入预测分割掩码的模块无需可训练参数，这为隔离不同集成策略的影响创造了受控的"体外"实验环境。通过利用不同骨干网络的互补优势，我们的方法在标准基准测试中（即便在极具挑战性的单样本学习场景下）均优于原始单骨干PANet。具体而言，在结合三个骨干网络的最优配置下，该方法在PASCAL-5ⁱ数据集上实现了+7.37%的性能提升，在COCO-20ⁱ数据集上实现了+10.68%的提升。这些结果结合对预测目标掩码的定性分析表明，在PANet中依赖多骨干网络能产生更全面的特征表示，从而有效推动FSS方法在数据稀缺的挑战性环境中的成功应用。