Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to events. Our approach outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks, including object recognition, segmentation, and depth estimation. Together, VFM-guided alignment and generative sequence modeling yield a semantically rich, temporally aware event model that generalizes robustly across domains.
翻译:事件相机凭借其微秒级延迟和高动态范围,能在快速运动及苛刻光照条件下提供鲁棒的视觉信号。然而,其独特的传感特性和有限的标注数据,使得训练基于事件的视觉基础模型(VFMs)面临挑战——而这正是学习跨任务可迁移视觉特征的关键。为解决该问题,我们提出GEP(生成式事件预训练)双阶段框架:该框架可将从互联网规模图像数据集中习得的语义知识迁移至事件数据,同时学习事件特有的时序动态。首先,通过联合回归-对比学习目标,将事件编码器与冻结的视觉基础模型对齐,使事件特征植根于图像语义;其次,在混合事件-图像序列上对Transformer主干网络进行自回归预训练,以捕获事件独有的时序结构。我们的方法在物体识别、分割及深度估计等多样化下游任务中,均超越了当前最先进的事件预训练方法。通过视觉基础模型引导对齐与生成式序列建模的协同作用,最终获得的兼具语义丰富性与时间感知能力的事件模型,能在不同领域展现出强大的泛化能力。