The past few years have seen an increased interest in aerial image object detection due to its critical value to large-scale geo-scientific research like environmental studies, urban planning, and intelligence monitoring. However, the task is very challenging due to the birds-eye view perspective, complex backgrounds, large and various image sizes, different appearances of objects, and the scarcity of well-annotated datasets. Recent advances in computer vision have shown promise tackling the challenge. Specifically, Vision Transformer Detector (ViTDet) was proposed to extract multi-scale features for object detection. The empirical study shows that ViTDet's simple design achieves good performance on natural scene images and can be easily embedded into any detector architecture. To date, ViTDet's potential benefit to challenging aerial image object detection has not been explored. Therefore, in our study, 25 experiments were carried out to evaluate the effectiveness of ViTDet for aerial image object detection on three well-known datasets: Airbus Aircraft, RarePlanes, and Dataset of Object DeTection in Aerial images (DOTA). Our results show that ViTDet can consistently outperform its convolutional neural network counterparts on horizontal bounding box (HBB) object detection by a large margin (up to 17% on average precision) and that it achieves the competitive performance for oriented bounding box (OBB) object detection. Our results also establish a baseline for future research.
翻译:近年来,航拍图像目标检测因其在环境研究、城市规划和智能监测等大规模地球科学研究中的关键价值而备受关注。然而,由于鸟瞰视角、复杂背景、图像尺寸多样性、目标外观差异大以及标注数据集稀缺等问题,该任务极具挑战性。计算机视觉领域的最新进展为应对这一挑战带来了希望。具体而言,视觉Transformer检测器(ViTDet)被提出用于提取多尺度特征以进行目标检测。实验研究表明,ViTDet的简洁设计在自然场景图像上取得了良好性能,且易于嵌入任何检测器架构中。迄今为止,ViTDet在具有挑战性的航拍图像目标检测中的潜在优势尚未被探索。因此,本研究通过25组实验,在三个知名数据集(Airbus Aircraft、RarePlanes和航拍图像目标检测数据集DOTA)上评估了ViTDet的有效性。结果表明,ViTDet在水平边界框(HBB)目标检测中显著优于其卷积神经网络对应方法(平均精度提升高达17%),并在旋转边界框(OBB)目标检测中达到了有竞争力的性能。本研究也为后续研究建立了基准。