Leveraging spatiotemporal information in videos is critical for weakly supervised video object localization (WSVOL) tasks. However, state-of-the-art methods only rely on visual and motion cues, while discarding discriminative information, making them susceptible to inaccurate localizations. Recently, discriminative models have been explored for WSVOL tasks using a temporal class activation mapping (CAM) method. Although their results are promising, objects are assumed to have limited movement from frame to frame, leading to degradation in performance for relatively long-term dependencies. This paper proposes a novel CAM method for WSVOL that exploits spatiotemporal information in activation maps during training without constraining an object's position. Its training relies on Co-Localization, hence, the name CoLo-CAM. Given a sequence of frames, localization is jointly learned based on color cues extracted across the corresponding maps, by assuming that an object has similar color in consecutive frames. CAM activations are constrained to respond similarly over pixels with similar colors, achieving co-localization. This improves localization performance because the joint learning creates direct communication among pixels across all image locations and over all frames, allowing for transfer, aggregation, and correction of localizations. Co-localization is integrated into training by minimizing the color term of a conditional random field (CRF) loss over a sequence of frames/CAMs. Extensive experiments on two challenging YouTube-Objects datasets of unconstrained videos show the merits of our CoLo-CAM method, and its robustness to long-term dependencies, leading to new state-of-the-art performance for WSVOL task.
翻译:利用视频中的时空信息对于弱监督视频对象定位(WSVOL)任务至关重要。然而,现有最先进方法仅依赖视觉和运动线索,却丢弃了判别性信息,导致其易受不精确定位的影响。近期,判别性模型通过时序类激活映射(CAM)方法被探索用于WSVOL任务。尽管其结果具有前景,但这些方法假设对象在帧间运动有限,导致在处理较长时域依赖时性能下降。本文提出一种新型用于WSVOL的CAM方法,其在训练过程中利用激活图中的时空信息而不约束对象位置。该训练依赖于共定位(Co-Localization),故命名为CoLo-CAM。给定一个帧序列,通过假设对象在连续帧中具有相似颜色,基于对应图中提取的颜色线索联合学习定位。CAM激活被约束为在相似颜色的像素上产生类似响应,从而实现共定位。这提升了定位性能,因为联合学习使得所有图像位置及所有帧中的像素之间建立直接通信,从而实现定位的迁移、聚合与校正。通过最小化序列帧/CAM上的条件随机场(CRF)损失中的颜色项,将共定位集成到训练过程中。在两个具有挑战性的无约束视频数据集YouTube-Objects上进行的大量实验表明,CoLo-CAM方法具有优势,且对长时依赖具有鲁棒性,从而在WSVOL任务上达到了新的最先进性能。