YONA: You Only Need One Adjacent Reference-frame for Accurate and Fast Video Polyp Detection

Accurate polyp detection is essential for assisting clinical rectal cancer diagnoses. Colonoscopy videos contain richer information than still images, making them a valuable resource for deep learning methods. Great efforts have been made to conduct video polyp detection through multi-frame temporal/spatial aggregation. However, unlike common fixed-camera video, the camera-moving scene in colonoscopy videos can cause rapid video jitters, leading to unstable training for existing video detection models. Additionally, the concealed nature of some polyps and the complex background environment further hinder the performance of existing video detectors. In this paper, we propose the \textbf{YONA} (\textbf{Y}ou \textbf{O}nly \textbf{N}eed one \textbf{A}djacent Reference-frame) method, an efficient end-to-end training framework for video polyp detection. YONA fully exploits the information of one previous adjacent frame and conducts polyp detection on the current frame without multi-frame collaborations. Specifically, for the foreground, YONA adaptively aligns the current frame's channel activation patterns with its adjacent reference frames according to their foreground similarity. For the background, YONA conducts background dynamic alignment guided by inter-frame difference to eliminate the invalid features produced by drastic spatial jitters. Moreover, YONA applies cross-frame contrastive learning during training, leveraging the ground truth bounding box to improve the model's perception of polyp and background. Quantitative and qualitative experiments on three public challenging benchmarks demonstrate that our proposed YONA outperforms previous state-of-the-art competitors by a large margin in both accuracy and speed.

翻译：精准的息肉检测对于辅助临床直肠癌诊断至关重要。结肠镜视频相较于静态图像包含更丰富的信息，使其成为深度学习方法的重要资源。现有研究通过多帧时序/空间聚合开展视频息肉检测已取得显著进展。然而，与常规固定摄像头视频不同，结肠镜视频中摄像头移动场景会引发剧烈视频抖动，导致现有视频检测模型的训练不稳定。此外，部分息肉的隐蔽特性及复杂背景环境进一步制约了现有视频检测器的性能。本文提出YONA（You Only Need One Adjacent Reference-frame）方法，这是一种用于视频息肉检测的高效端到端训练框架。YONA充分利用前一相邻帧的信息，无需多帧协同即可对当前帧进行息肉检测。具体而言：在前景处理中，YONA根据当前帧与相邻参考帧的前景相似性，自适应对齐其通道激活模式；在背景处理中，YONA通过帧间差异引导进行背景动态对齐，以消除剧烈空间抖动产生的无效特征。此外，YONA在训练过程中引入跨帧对比学习，利用真实边界框提升模型对息肉与背景的感知能力。在三个公开挑战性基准数据集上的定量与定性实验表明，我们提出的YONA在准确性和速度上均大幅超越先前最先进的竞争者。