Event-based multimodal large language models (MLLMs) enable robust perception in high-speed and low-light scenarios, addressing key limitations of frame-based MLLMs. However, current event-based MLLMs often rely on dense image-like processing paradigms, overlooking the spatiotemporal sparsity of event streams and resulting in high computational cost. In this paper, we propose EventFlash, a novel and efficient MLLM to explore spatiotemporal token sparsification for reducing data redundancy and accelerating inference. Technically, we build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets, providing both short and long event stream sequences to support our curriculum training strategy. We then present an adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens while retaining key temporal cues. Finally, a sparse density-guided attention module is designed to improve spatial token efficiency by selecting informative regions and suppressing empty or sparse areas. Experimental results show that EventFlash achieves a $12.4\times$ throughput improvement over the baseline (EventFlash-Zero) while maintaining comparable performance. It supports long-range event stream processing with up to 1,000 bins, significantly outperforming the 5-bin limit of EventGPT. We believe EventFlash serves as an efficient foundation model for event-based vision.
翻译:基于事件的多模态大语言模型(MLLMs)能够在高速与低光场景下实现鲁棒感知,解决了基于帧的MLLMs的关键局限。然而,当前的事件型MLLMs通常依赖密集的类图像处理范式,忽视了事件流的时空稀疏性,导致计算成本高昂。本文提出EventFlash,一种新颖高效的多模态大语言模型,旨在探索时空令牌稀疏化以降低数据冗余并加速推理。技术上,我们构建了EventMind——一个大规模、场景多样、包含超过50万条指令集的数据集,提供长短事件流序列以支持我们的课程学习训练策略。随后,我们提出一种自适应时间窗口聚合模块以实现高效时间采样,该模块在保留关键时间线索的同时自适应地压缩时间令牌。最后,我们设计了一个稀疏密度引导注意力模块,通过选择信息丰富的区域并抑制空置或稀疏区域来提升空间令牌效率。实验结果表明,EventFlash在保持可比性能的同时,相比基线模型(EventFlash-Zero)实现了$12.4\times$的吞吐量提升。它支持长达1000个时间仓的长范围事件流处理,显著优于EventGPT仅5个时间仓的限制。我们相信EventFlash可作为事件视觉领域的一个高效基础模型。