Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends to applications like autonomous driving, where a clear understanding of heavily occluded objects is essential. However, modern detection and tracking algorithms often overlook this critical capability, perhaps due to the prevalence of modal annotations in most datasets. To address the scarcity of amodal data, we introduce the TAO-Amodal benchmark, featuring 880 diverse categories in thousands of video sequences. Our dataset includes amodal and modal bounding boxes for visible and occluded objects, including objects that are partially out-of-frame. To enhance amodal tracking with object permanence, we leverage a lightweight plug-in module, the amodal expander, to transform standard, modal trackers into amodal ones through fine-tuning on a few hundred video sequences with data augmentation. We achieve a 3.3\% and 1.6\% improvement on the detection and tracking of occluded objects on TAO-Amodal. When evaluated on people, our method produces dramatic improvements of 2x compared to state-of-the-art modal baselines.
翻译:非模态感知(amodal perception)是一种即使对婴儿而言也至关重要的基本能力,能够从部分可见信息中理解完整物体结构。这一能力在自动驾驶等应用中具有关键意义——这些场景需要清晰理解严重遮挡的物体。然而,当前检测与跟踪算法常忽视这一关键能力,这可能源于大多数数据集采用模态标注。为解决非模态数据匮乏问题,我们提出TAO-Amodal基准数据集,包含上千个视频序列中的880个多样化类别。该数据集为可见与遮挡物体(包括部分处于画面外的物体)提供非模态和模态边界框。为增强具备物体恒存性的非模态跟踪能力,我们采用轻量级插入模块——非模态扩展器,通过数据增强技术对数百个视频序列进行微调,将标准模态跟踪器转化为非模态跟踪器。在TAO-Amodal数据集上,遮挡物体的检测与跟踪性能分别提升3.3%和1.6%。针对人体的评估中,我们的方法相较于最先进的模态基线方法实现了2倍的显著改进。