Event cameras offer the capacity to asynchronously capture brightness changes with low latency, high temporal resolution, and high dynamic range. Deploying deep learning methods for classification or other tasks to these sensors typically requires large labeled datasets. Since the amount of labeled event data is tiny compared to the bulk of labeled RGB imagery, the progress of event-based vision has remained limited. To reduce the dependency on labeled event data, we introduce Masked Event Modeling (MEM), a self-supervised pretraining framework for events. Our method pretrains a neural network on unlabeled events, which can originate from any event camera recording. Subsequently, the pretrained model is finetuned on a downstream task leading to an overall better performance while requiring fewer labels. Our method outperforms the state-of-the-art on N-ImageNet, N-Cars, and N-Caltech101, increasing the object classification accuracy on N-ImageNet by 7.96%. We demonstrate that Masked Event Modeling is superior to RGB-based pretraining on a real world dataset.
翻译:事件相机能够以低延迟、高时间分辨率和高动态范围异步捕捉亮度变化。将深度学习方法应用于这类传感器进行分类或其他任务,通常需要大量标注数据集。由于标注事件数据的数量远少于标注RGB图像数据,基于事件的视觉进展仍然有限。为减少对标注事件数据的依赖,我们提出掩模事件建模(Masked Event Modeling, MEM),一种针对事件的自监督预训练框架。该方法在未标注事件数据上预训练神经网络,这些数据可源自任何事件相机记录。随后,预训练模型在下游任务上进行微调,在减少标注需求的同时实现更优的整体性能。我们的方法在N-ImageNet、N-Cars和N-Caltech101数据集上超越了现有最优水平,在N-ImageNet上将物体分类准确率提升了7.96%。我们证明了掩模事件建模在实际数据集上优于基于RGB的预训练方法。