SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x, and enabling real-time E2E driving on a standard GPU.

翻译：在自动驾驶领域，直接从传感器数据预测控制指令的端到端（E2E）驾驶系统已取得显著进展。为在意外场景中实现安全驾驶，这些系统可能还需依赖自然语言指令等人为干预。使用多模态大语言模型（MLLM）可促进人车交互，并在此类场景中提升性能。然而，由于该方法依赖于大语言模型及传感器输入产生的大量视觉令牌，而自动驾驶车辆的计算资源有限，因此需要巨大的计算开销。许多多模态大语言模型研究已探索减少视觉令牌的方法，但相较于使用全部令牌，常导致终端任务性能下降。为实现高效的端到端驾驶，同时保持与使用全部令牌相当的性能，本文首次提出了面向多模态大语言模型的监督式令牌约简框架（SToRM）。该框架包含三个关键要素：首先，采用具有短期滑动窗口的轻量级重要性预测器来估计令牌重要性分数；其次，通过监督训练方法，利用辅助路径从全令牌大语言模型前向传播中获取伪监督信号；第三，通过锚点-上下文融合模块将令牌划分为锚点令牌和上下文令牌，并将上下文令牌融合至相关锚点，从而在最小化信息损失的同时减少冗余。在LangAuto基准测试上的实验表明，在相同的令牌约简预算下，SToRM优于最先进的端到端驾驶多模态大语言模型，在保持全令牌性能的同时将计算成本降低高达30倍，并能在标准GPU上实现实时端到端驾驶。