Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.
翻译:视觉-语言-行动模型日益依赖高频多视角观测,使得视觉通信成为带宽受限或分布式部署场景中实时机器人控制的主要瓶颈。然而,现有图像与视频编解码器旨在保持通用视觉保真度,而非保障下游VLA策略的控制性能。本文提出SPARC(空间自适应码率控制),一种专为VLA驱动机器人设计的学习型图像压缩框架。我们的关键发现是:视觉信息的重要性在相机视角间及图像内空间区域上存在显著差异。基于此发现,SPARC采用轻量级时间掩码选择器,根据任务相关性自适应分配潜在表示的码率,同时利用时间上下文信息。我们进一步引入倾斜率损失函数,通过降低基于熵的目标函数过度抑制罕见但任务关键视觉模式的倾向来稳定训练过程。在RoboCasa365、VLABench和LIBERO等多样化机器人基准测试上的实验表明,在相同码率预算下,SPARC始终比传统图像/视频编解码器及近期学习型压缩方法获得更强的控制性能。我们还展示了在远程控制场景中的实际部署优势,该方法显著改善了比特率-成功率权衡。