E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

Robotic Vision-Language-Action (VLA) models generalize well for open-ended manipulation, but their perception is fragile under sensing-stage degradations such as extreme low light, motion blur, and black clipping. We present E-VLA, an event-augmented VLA framework that improves manipulation robustness when conventional frame-based vision becomes unreliable. Instead of reconstructing images from events, E-VLA directly leverages motion and structural cues in event streams to preserve semantic perception and perception-action consistency under adverse conditions. We build an open-source teleoperation platform with a DAVIS346 event camera and collect a real-world synchronized RGB-event-action manipulation dataset across diverse tasks and illumination settings. We also propose lightweight, pretrained-compatible event integration strategies and study event windowing and fusion for stable deployment. Experiments show that even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images, could substantially improve robustness in dark and blur-heavy scenes: on Pick-Place at 20 lux, success increases from 0% (image-only) to 60% with overlay fusion and to 90% with our event adapter; under severe motion blur (1000 ms exposure), Pick-Place improves from 0% to 20-25%, and Sorting from 5% to 32.5%. Overall, E-VLA provides systematic evidence that event-driven perception can be effectively integrated into VLA models, pointing toward robust embodied intelligence beyond conventional frame-based imaging. Code and dataset will be available at https://github.com/JJayzee/E-VLA.

翻译：机器人视觉-语言-动作（VLA）模型在开放式操作任务中展现出良好的泛化能力，但其感知在极端低光照、运动模糊及黑色剪切等感知阶段退化条件下较为脆弱。我们提出E-VLA——一种事件增强型VLA框架，可在传统帧基视觉不可靠时提升操作鲁棒性。不同于从事件流重建图像，E-VLA直接利用事件流中的运动与结构线索，在不利条件下保持语义感知与感知-动作一致性。我们搭建了配备DAVIS346事件相机的开源遥操作平台，并采集了涵盖多种任务与光照条件的真实世界同步RGB-事件-动作操作数据集。同时提出轻量级、与预训练模型兼容的事件集成策略，并研究了事件窗口化与融合方法以实现稳定部署。实验表明，即使采用简单的无参数融合（将累积事件图叠加至RGB图像），也能在黑暗与严重模糊场景中显著提升鲁棒性：在20勒克斯照度下的拾放任务中，成功率从0%（纯图像）提升至叠加融合的60%及事件适配器的90%；在严重运动模糊（1000毫秒曝光）下，拾放任务从0%提升至20-25%，分类任务从5%提升至32.5%。总体而言，E-VLA提供了系统性证据，证明事件驱动感知可有效集成至VLA模型，为超越传统帧基成像的鲁棒具身智能指明方向。代码与数据集将发布于https://github.com/JJayzee/E-VLA。