Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends to applications like autonomous driving, where a clear understanding of heavily occluded objects is essential. However, modern detection and tracking algorithms often overlook this critical capability, perhaps due to the prevalence of modal annotations in most datasets. To address the scarcity of amodal data, we introduce the TAO-Amodal benchmark, featuring 880 diverse categories in thousands of video sequences. Our dataset includes amodal and modal bounding boxes for visible and occluded objects, including objects that are partially out-of-frame. To enhance amodal tracking with object permanence, we leverage a lightweight plug-in module, the amodal expander, to transform standard, modal trackers into amodal ones through fine-tuning on a few hundred video sequences with data augmentation. We achieve a 3.3\% and 1.6\% improvement on the detection and tracking of occluded objects on TAO-Amodal. When evaluated on people, our method produces dramatic improvements of 2x compared to state-of-the-art modal baselines.
翻译:非模态感知是指从部分可见性中理解完整物体结构的能力,这是一种基本技能,甚至婴儿也具备。它在自动驾驶等应用中具有重要意义,因为在这些应用中,清晰理解严重遮挡的物体至关重要。然而,现代检测和跟踪算法往往忽略这一关键能力,这或许是由于大多数数据集中模态标注的普遍性。为解决非模态数据的稀缺问题,我们提出了TAO-Amodal基准,包含数千个视频序列中的880个多样化类别。我们的数据集为可见和遮挡物体(包括部分位于画面外的物体)提供了非模态和模态边界框。为增强具有物体持久性的非模态跟踪,我们利用一个轻量级即插即用模块——非模态扩展器,通过数据增强在数百个视频序列上进行微调,将标准的模态跟踪器转换为非模态跟踪器。我们在TAO-Amodal上实现了对遮挡物体检测和跟踪的3.3%和1.6%性能提升。在对人的评估中,我们的方法相较于最先进的模态基线方法取得了2倍的显著改进。