Vision transformer (ViT) and its variants have swept through visual learning leaderboards and offer state-of-the-art accuracy in tasks such as image classification, object detection, and semantic segmentation by attending to different parts of the visual input and capturing long-range spatial dependencies. However, these models are large and computation-heavy. For instance, the recently proposed ViT-B model has 86M parameters making it impractical for deployment on resource-constrained devices. As a result, their deployment on mobile and edge scenarios is limited. In our work, we aim to take a step toward bringing vision transformers to the edge by utilizing popular model compression techniques such as distillation, pruning, and quantization. Our chosen application environment is an unmanned aerial vehicle (UAV) that is battery-powered and memory-constrained, carrying a single-board computer on the scale of an NVIDIA Jetson Nano with 4GB of RAM. On the other hand, the UAV requires high accuracy close to that of state-of-the-art ViTs to ensure safe object avoidance in autonomous navigation, or correct localization of humans in search-and-rescue. Inference latency should also be minimized given the application requirements. Hence, our target is to enable rapid inference of a vision transformer on an NVIDIA Jetson Nano (4GB) with minimal accuracy loss. This allows us to deploy ViTs on resource-constrained devices, opening up new possibilities in surveillance, environmental monitoring, etc. Our implementation is made available at https://github.com/chensy7/efficient-vit.
翻译:视觉Transformer(ViT)及其变体通过关注视觉输入的不同部分并捕获长程空间依赖性,在图像分类、目标检测和语义分割等任务中席卷了视觉学习排行榜,提供了最先进的精度。然而,这些模型规模庞大且计算密集。例如,近期提出的ViT-B模型拥有8600万参数,使其难以部署在资源受限设备上。因此,它们在移动和边缘场景中的应用受到限制。在我们的工作中,我们旨在通过利用知识蒸馏、剪枝和量化等流行的模型压缩技术,向将视觉Transformer部署到边缘设备迈进一步。我们选择的应用环境是一架电池供电、内存受限的无人机,搭载类似于NVIDIA Jetson Nano(4GB内存)级别的单板计算机。另一方面,无人机需要接近最先进ViT的高精度,以确保自主导航中的安全避障或搜救任务中准确定位人类。同时,根据应用需求,推理延迟应尽可能最小化。因此,我们的目标是实现视觉Transformer在NVIDIA Jetson Nano(4GB)上的快速推理,且精度损失最小。这使我们能够将ViT部署到资源受限设备上,为监控、环境监测等领域开辟新的可能性。我们的实现已在https://github.com/chensy7/efficient-vit 公开。