We endeavor on a rarely explored task named Insubstantial Object Detection (IOD), which aims to localize the object with following characteristics: (1) amorphous shape with indistinct boundary; (2) similarity to surroundings; (3) absence in color. Accordingly, it is far more challenging to distinguish insubstantial objects in a single static frame and the collaborative representation of spatial and temporal information is crucial. Thus, we construct an IOD-Video dataset comprised of 600 videos (141,017 frames) covering various distances, sizes, visibility, and scenes captured by different spectral ranges. In addition, we develop a spatio-temporal aggregation framework for IOD, in which different backbones are deployed and a spatio-temporal aggregation loss (STAloss) is elaborately designed to leverage the consistency along the time axis. Experiments conducted on IOD-Video dataset demonstrate that spatio-temporal aggregation can significantly improve the performance of IOD. We hope our work will attract further researches into this valuable yet challenging task. The code will be available at: \url{https://github.com/CalayZhou/IOD-Video}.
翻译:我们致力于研究一项鲜有探索的任务——无实形物体检测(Insubstantial Object Detection, IOD),该任务旨在定位具有以下特征的物体:(1)形状模糊、边界不清;(2)与环境高度相似;(3)缺乏色彩表现。因此,在单一静态帧中区分无实形物体极具挑战性,而空间与时间信息的协同表达至关重要。为此,我们构建了IOD-Video数据集,包含600个视频(共141,017帧),覆盖了不同距离、尺寸、可见度及由不同光谱范围捕获的场景。此外,我们提出了一种面向IOD的时空聚合框架,其中部署了多种骨干网络,并精心设计了时空聚合损失(STAloss)以利用时间轴上的连续性。在IOD-Video数据集上的实验表明,时空聚合能显著提升IOD性能。我们希望这项工作能吸引更多研究者关注这一有价值且充满挑战性的任务。代码将发布在:\url{https://github.com/CalayZhou/IOD-Video}。