Small Video Object Detection (SVOD) is a crucial subfield in modern computer vision, essential for early object discovery and detection. However, existing SVOD datasets are scarce and suffer from issues such as insufficiently small objects, limited object categories, and lack of scene diversity, leading to unitary application scenarios for corresponding methods. To address this gap, we develop the XS-VID dataset, which comprises aerial data from various periods and scenes, and annotates eight major object categories. To further evaluate existing methods for detecting extremely small objects, XS-VID extensively collects three types of objects with smaller pixel areas: extremely small (\textit{es}, $0\sim12^2$), relatively small (\textit{rs}, $12^2\sim20^2$), and generally small (\textit{gs}, $20^2\sim32^2$). XS-VID offers unprecedented breadth and depth in covering and quantifying minuscule objects, significantly enriching the scene and object diversity in the dataset. Extensive validations on XS-VID and the publicly available VisDrone2019VID dataset show that existing methods struggle with small object detection and significantly underperform compared to general object detectors. Leveraging the strengths of previous methods and addressing their weaknesses, we propose YOLOFT, which enhances local feature associations and integrates temporal motion features, significantly improving the accuracy and stability of SVOD. Our datasets and benchmarks are available at \url{https://gjhhust.github.io/XS-VID/}.
翻译:小型视频目标检测(SVOD)是现代计算机视觉中的一个关键子领域,对于早期目标发现与检测至关重要。然而,现有SVOD数据集稀缺,且存在目标尺寸不够小、目标类别有限、场景多样性不足等问题,导致相应方法的应用场景单一。为填补这一空白,我们开发了XS-VID数据集,该数据集包含来自不同时期和场景的航拍数据,并标注了八大主要目标类别。为进一步评估现有方法在检测极小型目标上的性能,XS-VID广泛收集了像素面积更小的三类目标:极小型(\textit{es}, $0\sim12^2$)、相对小型(\textit{rs}, $12^2\sim20^2$)和一般小型(\textit{gs}, $20^2\sim32^2$)。XS-VID在覆盖和量化微小目标方面提供了前所未有的广度和深度,显著丰富了数据集中的场景与目标多样性。在XS-VID及公开数据集VisDrone2019VID上进行的大量验证表明,现有方法在小目标检测上存在困难,其性能显著低于通用目标检测器。结合先前方法的优势并针对其不足,我们提出了YOLOFT,该方法增强了局部特征关联并整合了时序运动特征,显著提升了SVOD的准确性与稳定性。我们的数据集与基准测试已公开于 \url{https://gjhhust.github.io/XS-VID/}。