Video instance segmentation (VIS) for low-light content remains highly challenging for both humans and machines alike, due to noise, blur and other adverse conditions. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, consequently, perform poorly even after finetuning. In this paper, we introduce \textbf{ELVIS} (\textbf{E}nhance \textbf{L}ow-Light for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation), a framework that enables domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS is comprised of an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile estimation network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to \textbf{+3.7AP} on the synthetic low-light YouTube-VIS 2019 dataset and beats two-stage baselines by at least \textbf{+2.8AP} on real low-light videos. Code and dataset available at: \href{https://joannelin168.github.io/research/ELVIS}{https://joannelin168.github.io/research/ELVIS}
翻译:低光环境下的视频实例分割(VIS)因噪声、模糊及其他不利条件,对人类和机器而言仍然是极具挑战的任务。大规模标注数据集的匮乏以及现有合成流程(尤其是在时间退化建模方面)的局限性,进一步阻碍了该领域的发展。此外,现有VIS方法对低光视频中的退化现象缺乏鲁棒性,即使经过微调,其性能仍然低下。本文提出\textbf{ELVIS}(\textbf{E}nhance \textbf{L}ow-Light for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation)框架,实现了将最先进的VIS模型领域自适应至低光场景。ELVIS包含三个核心组件:一种无监督的合成低光视频流水线(可同时建模空间与时间退化)、无需标定的退化轮廓估计网络(VDP-Net)以及一个将退化特征与内容特征解耦的增强解码器头。在合成低光YouTube-VIS 2019数据集上,ELVIS性能提升高达\textbf{+3.7AP};在真实低光视频上,其表现至少领先两阶段基线方法\textbf{+2.8AP}。代码与数据集链接:\href{https://joannelin168.github.io/research/ELVIS}{https://joannelin168.github.io/research/ELVIS}