Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.

翻译：胶囊内镜（CE）能够实现无创的胃肠道筛查，但当前的CE研究仍主要集中在帧级别的分类和检测，视频级别的分析尚未得到充分探索。为弥合这一差距，我们引入并正式定义了一个新任务——诊断驱动的CE视频摘要，该任务需要提取覆盖具有临床意义的发现的关键证据帧，并从这些证据帧中做出准确诊断。这一设定具有挑战性，因为诊断相关的事件极为稀疏，可能被成千上万的冗余正常帧所淹没，而由于运动模糊、碎屑、镜面高光和视角快速变化，单个观察结果往往模棱两可。为促进该方向的研究，我们引入了VideoCAP，这是首个具有源自真实临床报告的诊断驱动标注的CE数据集。VideoCAP包含240个全长视频，并为关键证据帧提取和诊断提供了现实的监督。为解决这一任务，我们进一步提出了DiCE，这是一个模仿标准CE阅读流程的临床启发式框架。DiCE首先对原始视频进行高效的候选筛查，然后使用上下文编织器将候选帧组织成连贯的诊断上下文以保留不同的病变事件，并使用证据汇聚器将每个上下文内的多帧证据聚合为稳健的片段级判断。实验表明，DiCE始终优于现有最先进方法，生成简洁且临床可靠的诊断摘要。这些结果凸显了诊断驱动的上下文推理作为超长CE视频摘要的一种有前景的范式。