Planning-aligned Token Compression for Long-Context Autonomous Driving

Zhixuan Liang,Yuxiao Chen,Yurong You,Peter Karkus,Wenhao Ding,Boyi Li,Alexander Popov,Yan Wang,Maximilian Igl,Yiming Li,Danfei Xu,Nikolai Smolyanskiy,Boris Ivanovic,Ping Luo,Marco Pavone

from arxiv, 9 pages

Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $>$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.

翻译：单视觉-动作模型代表了自动驾驶领域新兴的范式。然而，在编码用于处理复杂交互的扩展时间上下文时，此类架构产生的令牌序列会迅速超出实时计算预算。尽管线性变压器和外部记忆等方法试图实现轻量级上下文，但令牌压缩因无需修改骨干网络而与架构最为兼容。然而，现有压缩采用基于规则的时间衰减等启发式方法，与规划解耦，存在丢弃决策关键信息的风险。我们提出COMPACT-VA，一种基于条件VQ-VAE的规划对齐工作记忆框架，将扩展上下文压缩为有界表征。压缩过程以历史轨迹和学习的规划意图为条件——后验编码器在训练期间从未来轨迹中蒸馏该意图，而先验编码器则学习从压缩观测中预测该意图。压缩记忆与预测潜变量连接后，输入策略网络进行端到端优化，使规划保留决策关键信息。我们在历史上下文对行为正确性（例如停车、让行或通行）最为关键的高信号动态场景中进行评估，并相应设计了行为指标。在可比令牌预算下，我们的成功率提升超过6%（达68.3%），各项指标均有持续增益。消融实验验证了规划对齐耦合的有效性。闭环评估证实，与未压缩处理相比，COMPACT-VA在保持通用驾驶性能的同时实现了3.3倍加速和2.7倍内存缩减。