Video Camouflaged Object Detection (VCOD) is currently constrained by the scarcity of challenging benchmarks and the limited robustness of models against erratic motion dynamics. Existing methods often struggle with Motion-Induced Appearance Instability and Temporal Feature Misalignment caused by complex motion scenarios. To address the data bottleneck, we present YUV20K, a pixel-level annoated complexity-driven VCOD benchmark. Comprising 24,295 annotated frames across 91 scenes and 47 kinds of species, it specifically targets challenging scenarios like large-displacement motion, camera motion and other 4 types scenarios. On the methodological front, we propose a novel framework featuring two key modules: Motion Feature Stabilization (MFS) and Trajectory-Aware Alignment (TAA). The MFS module utilizes frame-agnostic Semantic Basis Primitives to stablize features, while the TAA module leverages trajectory-guided deformable sampling to ensure precise temporal alignment. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art competitors on existing datasets and establishes a new baseline on the challenging YUV20K. Notably, our framework exhibits superior cross-domain generalization and robustness when confronting complex spatiotemporal scenarios. Our code and dataset will be available at https://github.com/K1NSA/YUV20K
翻译:视频伪装目标检测(VCOD)目前受限于挑战性基准的稀缺以及模型应对复杂运动动力学的鲁棒性不足。现有方法常因复杂运动场景导致的运动诱发外观不稳定性和时序特征错位而难以有效处理。为缓解数据瓶颈,我们提出YUV20K——一个基于像素级标注、复杂度驱动的VCOD基准数据集。该数据集包含覆盖91个场景、47种物种的24,295个标注帧,专门针对大位移运动、相机运动等4类挑战性场景。在方法层面,我们提出了一种包含运动特征稳定化(MFS)与轨迹感知对齐(TAA)两个关键模块的新型框架。MFS模块通过利用与帧无关的语义基元来稳定特征,而TAA模块则借助轨迹引导的可变形采样实现精确的时序对齐。大量实验表明,我们的方法在现有数据集上显著超越最优竞争方法,并在高挑战性的YUV20K数据集上确立了新的基准。值得注意的是,本框架在应对复杂时空场景时展现出优异的跨域泛化能力与鲁棒性。相关代码与数据集将开源至https://github.com/K1NSA/YUV20K。