Deep neural networks for visual perception are highly susceptible to domain shift, which poses a critical challenge for real-world deployment under conditions that differ from the training data. To address this domain generalization challenge, we propose a cross-modal framework under the learning using privileged information (LUPI) paradigm for training a robust, single-modality RGB model. We leverage event cameras as a source of privileged information, available only during training. The two modalities exhibit complementary characteristics: the RGB stream is semantically dense but domain-dependent, whereas the event stream is sparse yet more domain-invariant. Direct feature alignment between them is therefore suboptimal, as it forces the RGB encoder to mimic the sparse event representation, thereby losing semantic detail. To overcome this, we introduce Privileged Event-based Predictive Regularization (PEPR), which reframes LUPI as a predictive problem in a shared latent space. Instead of enforcing direct cross-modal alignment, we train the RGB encoder with PEPR to predict event-based latent features, distilling robustness without sacrificing semantic richness. The resulting standalone RGB model consistently improves robustness to day-to-night and other domain shifts, outperforming alignment-based baselines across object detection and semantic segmentation.
翻译:用于视觉感知的深度神经网络极易受到领域偏移的影响,这对其在不同于训练数据的现实条件下部署构成了关键挑战。为应对这一领域泛化挑战,我们在利用特权信息学习范式下提出了一种跨模态框架,用于训练鲁棒的单模态RGB模型。我们利用事件相机作为特权信息来源,该信息仅在训练阶段可用。两种模态呈现出互补特性:RGB流语义密集但具有领域依赖性,而事件流稀疏却更具领域不变性。因此,直接对二者进行特征对齐并非最优方案,因为这迫使RGB编码器模仿稀疏的事件表征,从而丢失语义细节。为解决此问题,我们提出了基于特权事件预测正则化方法,该方法将特权信息学习重构为共享潜在空间中的预测问题。PEPR不强制进行跨模态直接对齐,而是训练RGB编码器预测基于事件的潜在特征,在保持语义丰富性的同时蒸馏出鲁棒性。由此获得的独立RGB模型持续提升了对昼夜转换及其他领域偏移的鲁棒性,在目标检测和语义分割任务中均优于基于对齐的基线方法。