SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.

翻译：在自动驾驶领域，直接从传感器数据预测控制指令的端到端（E2E）驾驶系统已取得显著进展。为在意外场景下安全驾驶，此类系统可能额外依赖自然语言指令等人为干预。使用多模态大语言模型（MLLM）可促进人车交互，并提升此类场景下的性能。然而，由于依赖LLM及传感器输入产生的大量视觉令牌，而自动驾驶车辆计算资源有限，该方法需要大量计算资源。许多MLLM研究探索了视觉令牌约简方法，但与使用全部令牌相比，常导致最终任务性能下降。为实现高效E2E驾驶，同时保持与使用全部令牌相当的性能，本文首次提出面向多模态LLM的监督式令牌约简框架（SToRM）。该框架包含三个关键要素：首先，采用具有短期滑动窗口的轻量级重要性预测器来估计令牌重要性分数；其次，通过监督训练方法，利用辅助路径从全令牌LLM前向传播中获取伪监督信号；第三，通过锚点-上下文融合模块将令牌划分为锚点令牌和上下文令牌，并将上下文令牌融合至相关锚点，从而在最小化信息损失的同时减少冗余。在LangAuto基准上的实验表明，在相同令牌约简预算下，SToRM优于最先进的E2E驾驶MLLM，在保持全令牌性能的同时，将计算成本降低高达30倍。