Towards a High-Performance Object Detector: Insights from Drone Detection Using ViT and CNN-based Deep Learning Models

Accurate drone detection is strongly desired in drone collision avoidance, drone defense and autonomous Unmanned Aerial Vehicle (UAV) self-landing. With the recent emergence of the Vision Transformer (ViT), this critical task is reassessed in this paper using a UAV dataset composed of 1359 drone photos. We construct various CNN and ViT-based models, demonstrating that for single-drone detection, a basic ViT can achieve performance 4.6 times more robust than our best CNN-based transfer learning models. By implementing the state-of-the-art You Only Look Once (YOLO v7, 200 epochs) and the experimental ViT-based You Only Look At One Sequence (YOLOS, 20 epochs) in multi-drone detection, we attain impressive 98% and 96% mAP values, respectively. We find that ViT outperforms CNN at the same epoch, but also requires more training data, computational power, and sophisticated, performance-oriented designs to fully surpass the capabilities of cutting-edge CNN detectors. We summarize the distinct characteristics of ViT and CNN models to aid future researchers in developing more efficient deep learning models.

翻译：无人机精准检测在无人机避障、无人机防御及自主无人飞行器（UAV）自着陆中具有迫切需求。随着视觉Transformer（ViT）的近期兴起，本文利用由1359张无人机照片组成的UAV数据集重新评估了这一关键任务。我们构建了多种CNN和ViT模型，证明在单无人机检测中，基础ViT模型的鲁棒性表现比我们最优的基于CNN的迁移学习模型高出4.6倍。通过在多无人机检测中应用最先进的You Only Look Once（YOLO v7，200轮）和基于ViT的实验性模型You Only Look At One Sequence（YOLOS，20轮），我们分别获得了令人瞩目的98%和96%的mAP值。我们发现ViT在相同训练轮次下优于CNN，但需要更多训练数据、计算算力及复杂且面向性能的设计才能完全超越前沿CNN检测器的能力。我们总结了ViT与CNN模型的独特特征，以帮助未来研究者开发更高效的深度学习模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日