We introduce YOGA, a deep learning based yet lightweight object detection model that can operate on low-end edge devices while still achieving competitive accuracy. The YOGA architecture consists of a two-phase feature learning pipeline with a cheap linear transformation, which learns feature maps using only half of the convolution filters required by conventional convolutional neural networks. In addition, it performs multi-scale feature fusion in its neck using an attention mechanism instead of the naive concatenation used by conventional detectors. YOGA is a flexible model that can be easily scaled up or down by several orders of magnitude to fit a broad range of hardware constraints. We evaluate YOGA on COCO-val and COCO-testdev datasets with other over 10 state-of-the-art object detectors. The results show that YOGA strikes the best trade-off between model size and accuracy (up to 22% increase of AP and 23-34% reduction of parameters and FLOPs), making it an ideal choice for deployment in the wild on low-end edge devices. This is further affirmed by our hardware implementation and evaluation on NVIDIA Jetson Nano.
翻译:我们提出YOGA,一种基于深度学习但轻量级的目标检测模型,可在低端边缘设备上运行并保持竞争力的精度。YOGA架构包含一个两阶段特征学习流程,通过廉价线性变换仅使用传统卷积神经网络所需一半数量的卷积滤波器即可学习特征图。此外,其颈部采用注意力机制替代传统检测器使用的简单拼接方法,实现多尺度特征融合。YOGA是一种灵活模型,可轻松按数量级缩放以适配广泛的硬件约束。我们在COCO-val和COCO-testdev数据集上与其他十余种最先进目标检测器进行评估。结果表明,YOGA在模型大小与精度之间取得了最佳权衡(AP提升高达22%,参数与FLOPs降低23-34%),成为低端边缘设备野外部署的理想选择。我们在NVIDIA Jetson Nano上的硬件实现与评估进一步证实了这一点。