FCMBench-Video: Benchmarking Document Video Intelligence

Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with prescribed temporal spans. FCMBench-Video is built from 495 atomic videos composed into 1,200 long-form videos paired with 11,322 expert-annotated question--answer instances, covering 28 document types over 20s--60s duration tiers and 5,960 Chinese / 5,362 English instances. Evaluations on nine recent Video-MLLMs show that FCMBench-Video provides meaningful separation across systems and capabilities: counting is the most duration-sensitive task, Cross-Document Validation and Evidence-Grounded Selection probe higher-level evidence integration, and Visual Prompt Injection provides a complementary robustness dimension. The overall score distribution is broad and approximately bell-shaped, indicating a benchmark that is neither saturated nor dominated by trivial cases. Together, these results position FCMBench-Video as a reproducible benchmark for tracking Video-MLLM progress on document-video understanding and probing capability boundaries in authenticity-sensitive credit-domain applications.

翻译：文档理解是金融信贷审核、入职流程及远程验证中的关键能力，既要求决策准确性也需具备证据可溯源性。相较于静态文档图像，文档视频呈现出时间冗余且序列展开的证据流，需要跨帧整合证据，并保留与真实性敏感和反欺诈审核相关的采集过程线索。我们提出FCMBench-Video——一个面向文档视频智能的基准测试，在真实采集条件下评估文档感知、时间定位及基于证据的推理能力。为在合规前提下大规模生成真实数据，我们构建了原子采集与组合的工作流：录制可复用的单文档片段，施加受控退化处理，并组合成具有规定时间跨度的长时多文档视频。FCMBench-Video包含495个原子视频，组合为1,200个长视频，配有11,322个专家标注的问答实例，覆盖28种文档类型，时长为20-60秒，包含5,960个中文实例和5,362个英文实例。针对九个最新Video-MLLMs的评估表明，FCMBench-Video能有效区分不同系统与能力：计数是最受时长影响的任务，跨文档验证与基于证据的选择考察高级证据整合能力，视觉提示注入则提供补充性的鲁棒性维度。总体得分分布宽广且近似钟形，表明该基准测试既未饱和也未受简单案例主导。上述结果共同确立了FCMBench-Video作为可复现基准的地位，可用于追踪Video-MLLMs在文档视频理解方面的进展，并探测信贷领域真实性敏感应用中的能力边界。