FCMBench-Video: Benchmarking Document Video Intelligence

Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with prescribed temporal spans. FCMBench-Video is built from 495 atomic videos composed into 1,200 long-form videos paired with 11,322 expert-annotated question--answer instances, covering 28 document types over 20s--60s duration tiers and 5,960 Chinese / 5,362 English instances. Evaluations on nine recent Video-MLLMs show that FCMBench-Video provides meaningful separation across systems and capabilities: counting is the most duration-sensitive task, Cross-Document Validation and Evidence-Grounded Selection probe higher-level evidence integration, and Visual Prompt Injection provides a complementary robustness dimension. The overall score distribution is broad and approximately bell-shaped, indicating a benchmark that is neither saturated nor dominated by trivial cases. Together, these results position FCMBench-Video as a reproducible benchmark for tracking Video-MLLM progress on document-video understanding and probing capability boundaries in authenticity-sensitive credit-domain applications.

翻译：文档理解是金融信用审核、用户准入及远程核身中的关键能力，既要求决策准确性，也强调证据可追溯性。与静态文档图像相比，文档视频呈现具有时间冗余性和顺序展开特性的证据流，需要跨帧整合证据，并保留与真实性敏感及反欺诈审核相关的采集过程线索。我们提出FCMBench-Video，一个面向文档视频智能的基准，在真实采集条件下评估文档感知、时间定位及基于证据的推理能力。为在保护隐私的前提下规模化获取真实数据，我们采用原子化采集与组合编排的工作流：录制可复用的单文档片段，施加受控退化，并组装成规定时间跨度的长篇幅多文档视频。FCMBench-Video由495个原子视频组合成1,200个长视频，配以11,322个专家标注的问答实例，覆盖20秒至60秒时长分档的28种文档类型，包括5,960个中文实例和5,362个英文实例。对九个最新视频多模态大语言模型的评测表明，FCMBench-Video能有效区分不同系统与能力维度：计数是对时长最敏感的任务，跨文档验证与基于证据的选择考察更高层次的证据整合能力，视觉提示注入则提供了互补的鲁棒性维度。整体得分分布宽广且近似钟形，说明该基准既未饱和也未受琐碎情形主导。综上，这些结果使FCMBench-Video成为可复现的基准，用于追踪视频多模态大语言模型在文档视频理解上的进展，并探测真实性敏感的信贷领域应用中的能力边界。