Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxiliary modalities, such as event streams and infrared images. However, these methods typically assume the availability of these modalities at inference, which is often not feasible in real-world scenarios. To solve this problem, in this work, we propose AMNet, a unified multimodal framework for LLVE, to support flexible modality-agnostic inference, where auxiliary modalities may be unavailable. To address the issue of modality absence, we introduce a Spatial-Spectral Dual-Gated Translator that learns the correspondence between auxiliary modalities and RGB inputs, producing implicit auxiliary representations to support the robust enhancement. Additionally, to fully facilitate the learning of cross-modal correspondence, we conduct large-scale multimodal pretraining based on the RGB-only dataset with synthetic auxiliary modalities. Extensive experiments demonstrate that AMNet could handle arbitrary inference-time modality combinations and exhibits superior performance for LLVE under modality absence conditions. Code and models are available on the project page.
翻译:低光照视频增强任务因在弱照明条件下信息严重退化而仍具挑战性。近年来,通过引入事件流、红外图像等辅助模态,多模态方法显著提升了增强性能。然而,这些方法通常假设推理时能获取辅助模态,这在真实场景中往往难以实现。为解决该问题,本文提出统一多模态框架AMNet,用于支持灵活的模态无关推理,即允许辅助模态缺失。针对模态缺失问题,我们引入空间-频谱双控门翻译器,通过学习辅助模态与RGB输入间的对应关系,生成隐式辅助表示以支撑鲁棒增强。此外,为充分促进跨模态对应学习,我们基于仅含RGB数据的合成辅助模态开展大规模多模态预训练。大量实验表明,AMNet可处理任意推理时的模态组合,并在模态缺失条件下展现出优越的低光照视频增强性能。代码与模型已发布于项目主页。