This paper demonstrates the feasibility of transformer-based split inference for real-time video object detection over dynamic 5G AI-RAN networks. We extend throughput-aware adaptive splitting from CNNs to a Swin Transformer backbone and show that practical split execution is achievable for transformer-based vision models without retraining. To address the large intermediate activations inherent to transformers, we introduce an efficient, accuracy-preserving activation compression pipeline that substantially reduces uplink payload. The complete system -- including adaptive split selection, transformer inference, and compression -- is implemented and validated end-to-end on a real-time detection workload, with distributed UPF (dUPF) integration further reducing user-plane latency and improving runtime stability. Extensive measurements on an NVIDIA Aerial-based AI-RAN testbed jointly account for inference and 5G communication energy, quantifying the latency-energy-privacy trade-offs in realistic deployments.
翻译:本文证明了基于Transformer的拆分推理在动态5G AI-RAN网络上实现实时视频目标检测的可行性。我们将面向吞吐量的自适应拆分方法从CNN扩展到Swin Transformer主干网络,并展示了无需重新训练即可实现基于Transformer的视觉模型的实际拆分执行。为解决Transformer固有的较大中间激活问题,我们引入了一种高效且保持精度的激活压缩流水线,大幅降低了上行链路载荷。完整系统(包括自适应拆分选择、Transformer推理和压缩)已在实时检测工作负载上实现并进行了端到端验证,同时通过分布式UPF(dUPF)集成进一步减少了用户面时延并提升了运行稳定性。在基于NVIDIA Aerial的AI-RAN测试平台上进行的大量测量同时考虑了推理与5G通信能耗,量化了实际部署中的时延-能耗-隐私权衡。