Recent camouflaged object detection (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios. Apart from the high intrinsic similarity between camouflaged objects and their background, objects are usually diverse in scale, fuzzy in appearance, and even severely occluded. To this end, we propose an effective unified collaborative pyramid network which mimics human behavior when observing vague images and videos, \textit{i.e.}, zooming in and out. Specifically, our approach employs the zooming strategy to learn discriminative mixed-scale semantics by the multi-head scale integration and rich granularity perception units, which are designed to fully explore imperceptible clues between candidate objects and background surroundings. The former's intrinsic multi-head aggregation provides more diverse visual patterns. The latter's routing mechanism can effectively propagate inter-frame difference in spatiotemporal scenarios and adaptively ignore static representations. They provides a solid foundation for realizing a unified architecture for static and dynamic COD. Moreover, considering the uncertainty and ambiguity derived from indistinguishable textures, we construct a simple yet effective regularization, uncertainty awareness loss, to encourage predictions with higher confidence in candidate regions. Our highly task-friendly framework consistently outperforms existing state-of-the-art methods in image and video COD benchmarks. The code will be available at \url{https://github.com/lartpang/ZoomNeXt}.
翻译:近期伪装目标检测(COD)试图将视觉上融入背景的目标分割出来,这在现实场景中极为复杂且困难。除伪装目标与背景之间存在高度内在相似性外,目标通常尺度多样、外观模糊,甚至遭遇严重遮挡。为此,我们提出一种高效统一协作金字塔网络,该网络模拟人类观察模糊图像和视频时的"放大与缩小"行为。具体而言,我们的方法采用缩放策略,通过多头尺度集成与丰富粒度感知单元学习具有判别性的混合尺度语义,从而充分挖掘候选目标与背景环境间的不可察觉线索。前者固有的多头聚合机制可提供更多样化的视觉模式,后者的路由机制能有效传播时空场景中的帧间差异,并自适应忽略静态表征。这为实现静态与动态COD的统一架构奠定了坚实基础。此外,针对不可区分纹理带来的不确定性和模糊性,我们构建了一种简洁高效的正则化方法——不确定性感知损失,以鼓励对候选区域进行更高置信度的预测。这一高度任务友好的框架在图像与视频COD基准测试中持续优于现有最优方法。代码将开源至\url{https://github.com/lartpang/ZoomNeXt}。