Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant anatomical views, while decisive evidence is temporally sparse, spatially subtle, and context dependent. Existing benchmarks often assume this evidence has already been localized through images, short clips, or pre-segmented videos, leaving the retrieval-before-reasoning problem under-tested. We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding. MedHorizon preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questionsthat jointly evaluate sparse evidence understanding and multi-hop clinical reasoning. Its evidence is extremely sparse, with only 0.166% evidence frames on average, requiring models to search noisy procedural streams before interpreting and aggregating findings. We evaluate representative general-domain, medical-domain, and long-video MLLMs. The best model reaches only 41.1% accuracy, showing that current systems remain far from robust full-procedure understanding. Further analysis yields four key findings: performance does not scale reliably with more frames, evidence retrieval and clinical interpretation remain primary bottlenecks; these bottlenecks are rooted in weak procedural reasoning and attention drift under redundancy, and generic sampling methods only partially balances local detail with global coverage. MedHorizon provides a rigorous testbed for MLLMs that retrieve sparse evidence and reason over complete clinical workflows.
翻译:医疗多模态大语言模型在图像理解和短视频分析方面取得了进展,但真实的临床审查通常需要全流程视频理解。与通用长视频不同,医疗流程包含高度冗余的解剖视图,而决定性证据在时间上稀疏、空间上细微且依赖上下文。现有基准测试通常假定这些证据已通过图像、短视频或预分割视频定位,导致"检索-推理"问题未被充分测试。我们提出MedHorizon,一个用于长上下文医疗视频理解的真实场景基准。MedHorizon包含759小时的全长临床流程,并提供1,253个基于证据的多选题,共同评估稀疏证据理解与多跳临床推理。其证据极为稀疏,平均仅占0.166%的证据帧,要求模型在解释和汇总发现之前先搜索嘈杂的流程流。我们评估了具有代表性的通用领域、医学领域和长视频多模态大语言模型。最佳模型仅达到41.1%的准确率,表明当前系统仍远未实现稳健的全流程理解。进一步分析得出四个关键发现:性能不会随帧数增加而可靠提升;证据检索与临床解释仍是主要瓶颈;这些瓶颈源于冗余环境下的弱流程推理与注意力漂移;通用采样方法仅能部分平衡局部细节与全局覆盖。MedHorizon为检索稀疏证据并在完整临床工作流程上进行推理的多模态大语言模型提供了一个严格的测试平台。