Vision-Language Models (VLMs) enable multimodal reasoning for robotic perception and interaction, but their deployment in real-world systems remains constrained by latency, limited onboard resources, and privacy risks of cloud offloading. Edge intelligence within 6G, particularly Open RAN and Multi-access Edge Computing (MEC), offers a pathway to address these challenges by bringing computation closer to the data source. This work investigates the deployment of VLMs on ORAN/MEC infrastructure using the Unitree G1 humanoid robot as an embodied testbed. We design a WebRTC-based pipeline that streams multimodal data to an edge node and evaluate LLaMA-3.2-11B-Vision-Instruct deployed at the edge versus in the cloud under real-time conditions. Our results show that edge deployment preserves near-cloud accuracy while reducing end-to-end latency by 5\%. We further evaluate Qwen2-VL-2B-Instruct, a compact model optimized for resource-constrained environments, which achieves sub-second responsiveness, cutting latency by more than half but at the cost of accuracy.
翻译:视觉语言模型(VLMs)为机器人感知与交互提供了多模态推理能力,但其在实际系统中的部署仍受限于延迟、机载资源有限以及云端卸载带来的隐私风险。6G网络内的边缘智能,特别是开放式无线接入网(Open RAN)与多接入边缘计算(MEC),通过将计算任务移至更靠近数据源的位置,为解决这些挑战提供了可行路径。本研究以宇树G1人形机器人为具身化测试平台,探讨了在ORAN/MEC基础设施上部署VLMs的方案。我们设计了一套基于WebRTC的流水线,用于将多模态数据流式传输至边缘节点,并在实时条件下评估了部署于边缘的LLaMA-3.2-11B-Vision-Instruct模型与云端部署版本的性能对比。实验结果表明,边缘部署在保持接近云端精度的同时,将端到端延迟降低了5%。我们进一步评估了专为资源受限环境优化的紧凑模型Qwen2-VL-2B-Instruct,该模型实现了亚秒级响应,将延迟削减了一半以上,但代价是精度有所下降。