With the increasing ubiquity of AR/VR devices, the deployment of deep learning models on edge devices has become a critical challenge. These devices require real-time inference, low power consumption, and minimal latency. Many framework designers face the conundrum of balancing efficiency and performance. We design a light framework that adopts an encoder-decoder architecture and introduces several key contributions aimed at improving both efficiency and accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency improvement. Moreover, we propose our SPLite decoder. This new architecture significantly boosts the decoding process's frame rate by 3.1x on the Raspberry Pi 5, while maintaining accuracy on par. To further optimize performance, we apply quantization-aware training, reducing memory usage while preserving accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5 CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on compound benchmark datasets, demonstrating comparable accuracy to state-of-the-art approaches while significantly enhancing computational efficiency.
翻译:随着增强现实/虚拟现实(AR/VR)设备的日益普及,在边缘设备上部署深度学习模型已成为一项关键挑战。这些设备需要实时推理、低功耗和最小延迟。许多框架设计者面临着平衡效率与性能的难题。我们设计了一个轻量级框架,采用编码器-解码器架构,并引入了多项旨在同时提升效率和准确性的关键贡献。我们在ResNet-18骨干网络上应用稀疏卷积,以利用手部姿态图像中固有的稀疏性,实现了42%的端到端效率提升。此外,我们提出了SPLite解码器。这一新架构在Raspberry Pi 5上将解码过程的帧率显著提升了3.1倍,同时保持了相当的精度。为进一步优化性能,我们应用量化感知训练,在保持精度的同时减少了内存使用(在FreiHAND数据集上,PA-MPJPE仅从9.0毫米略微增加至9.1毫米)。总体而言,我们的系统在Raspberry Pi 5 CPU(BCM2712四核Arm A76处理器)上实现了2.98倍的加速。我们的方法还在复合基准数据集上进行了评估,结果表明其在计算效率显著提升的同时,达到了与最先进方法相当的精度。