DataCross: A Unified Benchmark and Agent Framework for Cross-Modal Heterogeneous Data Analysis

In real-world data science and enterprise decision-making, critical information is often fragmented across directly queryable structured sources (e.g., SQL, CSV) and "zombie data" locked in unstructured visual documents (e.g., scanned reports, invoice images). Existing data analytics agents are predominantly limited to processing structured data, failing to activate and correlate this high-value visual information, thus creating a significant gap with industrial needs. To bridge this gap, we introduce DataCross, a novel benchmark and collaborative agent framework for unified, insight-driven analysis across heterogeneous data modalities. DataCrossBench comprises 200 end-to-end analysis tasks across finance, healthcare, and other domains. It is constructed via a human-in-the-loop reverse-synthesis pipeline, ensuring realistic complexity, cross-source dependency, and verifiable ground truth. The benchmark categorizes tasks into three difficulty tiers to evaluate agents' capabilities in visual table extraction, cross-modal alignment, and multi-step joint reasoning. We also propose the DataCrossAgent framework, inspired by the "divide-and-conquer" workflow of human analysts. It employs specialized sub-agents, each an expert on a specific data source, which are coordinated via a structured workflow of Intra-source Deep Exploration, Key Source Identification, and Contextual Cross-pollination. A novel reReAct mechanism enables robust code generation and debugging for factual verification. Experimental results show that DataCrossAgent achieves a 29.7% improvement in factuality over GPT-4o and exhibits superior robustness on high-difficulty tasks, effectively activating fragmented "zombie data" for insightful, cross-modal analysis.

翻译：在现实世界的数据科学与企业决策中，关键信息常常分散于可直接查询的结构化数据源（如SQL、CSV）与锁定在非结构化视觉文档（如扫描报告、发票图像）中的“僵尸数据”之间。现有的数据分析智能体主要局限于处理结构化数据，无法激活并关联这些高价值的视觉信息，从而与工业需求存在显著差距。为弥合这一差距，我们提出了DataCross，一个新颖的基准与协作式智能体框架，旨在实现对异构数据模态的统一、洞察驱动的分析。DataCrossBench包含来自金融、医疗等领域的200个端到端分析任务。它通过人机协同的逆向合成流程构建，确保了真实的复杂性、跨源依赖性和可验证的基准真值。该基准将任务划分为三个难度等级，以评估智能体在视觉表格提取、跨模态对齐以及多步骤联合推理方面的能力。我们还提出了DataCrossAgent框架，其灵感来源于人类分析师的“分而治之”工作流程。它采用专门化的子智能体，每个子智能体都是特定数据源的专家，并通过“源内深度探索”、“关键源识别”和“上下文交叉融合”这一结构化工作流进行协调。一种新颖的reReAct机制为事实核查提供了稳健的代码生成与调试能力。实验结果表明，DataCrossAgent在事实准确性上相比GPT-4o提升了29.7%，并在高难度任务上展现出卓越的鲁棒性，能够有效激活碎片化的“僵尸数据”，实现富有洞察力的跨模态分析。