Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with prescribed temporal spans. FCMBench-Video is built from 495 atomic videos composed into 1,200 long-form videos paired with 11,322 expert-annotated question--answer instances, covering 28 document types over 20s--60s duration tiers and 5,960 Chinese / 5,362 English instances. Evaluations on nine recent Video-MLLMs show that FCMBench-Video provides meaningful separation across systems and capabilities: counting is the most duration-sensitive task, Cross-Document Validation and Evidence-Grounded Selection probe higher-level evidence integration, and Visual Prompt Injection provides a complementary robustness dimension. The overall score distribution is broad and approximately bell-shaped, indicating a benchmark that is neither saturated nor dominated by trivial cases. Together, these results position FCMBench-Video as a reproducible benchmark for tracking Video-MLLM progress on document-video understanding and probing capability boundaries in authenticity-sensitive credit-domain applications.
翻译:文档理解是金融信用审核、用户准入及远程核身中的关键能力,既要求决策准确性,也强调证据可追溯性。与静态文档图像相比,文档视频呈现具有时间冗余性和顺序展开特性的证据流,需要跨帧整合证据,并保留与真实性敏感及反欺诈审核相关的采集过程线索。我们提出FCMBench-Video,一个面向文档视频智能的基准,在真实采集条件下评估文档感知、时间定位及基于证据的推理能力。为在保护隐私的前提下规模化获取真实数据,我们采用原子化采集与组合编排的工作流:录制可复用的单文档片段,施加受控退化,并组装成规定时间跨度的长篇幅多文档视频。FCMBench-Video由495个原子视频组合成1,200个长视频,配以11,322个专家标注的问答实例,覆盖20秒至60秒时长分档的28种文档类型,包括5,960个中文实例和5,362个英文实例。对九个最新视频多模态大语言模型的评测表明,FCMBench-Video能有效区分不同系统与能力维度:计数是对时长最敏感的任务,跨文档验证与基于证据的选择考察更高层次的证据整合能力,视觉提示注入则提供了互补的鲁棒性维度。整体得分分布宽广且近似钟形,说明该基准既未饱和也未受琐碎情形主导。综上,这些结果使FCMBench-Video成为可复现的基准,用于追踪视频多模态大语言模型在文档视频理解上的进展,并探测真实性敏感的信贷领域应用中的能力边界。