Real-time object detection in AR/VR systems faces critical computational constraints, requiring sub-10\,ms latency within tight power budgets. Inspired by biological foveal vision, we propose a two-stage pipeline that combines differentiable weightless neural networks for ultra-efficient gaze estimation with attention-guided region-of-interest object detection. Our approach eliminates arithmetic-intensive operations by performing gaze tracking through memory lookups rather than multiply-accumulate computations, achieving an angular error of $8.32^{\circ}$ with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50\% and energy consumption by 65\%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1\% mAP on COCO (51.8\% on attended objects) while maintaining sub-10\,ms latency, meeting stringent AR/VR requirements by improving the communication time by $\times 177$. Compared to the global YOLOv12n baseline, which achieves 39.2\%, 63.4\%, and 83.1\% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3\%, 72.1\%, and 88.1\% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing.
翻译:增强现实/虚拟现实系统中的实时目标检测面临严峻的计算约束,需要在严格的功耗预算内实现低于10毫秒的延迟。受生物中央凹视觉启发,我们提出一种两阶段处理流程,将用于超高效视线估计的可微分无权神经网络与注意力引导的感兴趣区域目标检测相结合。我们的方法通过内存查找而非乘积累加计算来执行视线跟踪,从而消除了算术密集型操作,在每帧仅需393次MAC操作和2.2 KiB内存的条件下实现了$8.32^{\circ}$的角度误差。视线预测引导对关注区域进行选择性目标检测,将计算负担降低40-50%,能耗降低65%。在Arduino Nano 33 BLE平台上部署后,我们的系统在COCO数据集上达到48.1%的mAP(关注对象为51.8%),同时保持低于10毫秒的延迟,通过将通信时间提升$\times 177$倍满足严格的AR/VR要求。与全局处理的YOLOv12n基线(对小、中、大尺寸目标的检测精度分别为39.2%、63.4%和83.1%)相比,在相同设置下基于感兴趣区域的方法分别达到51.3%、72.1%和88.1%。本研究表明,对于资源受限的可穿戴平台,采用显式注意力建模的以内存为中心的架构比均匀处理具备更优的效率和精度。