Document Question Answering (DocQA) focuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To enable end-to-end training of such agents, we introduce an Exploration-then-Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.
翻译:文档问答(DocQA)专注于基于给定文档回答相关问题,然而现有的DocQA智能体缺乏有效的工具利用能力,且主要依赖闭源模型。本研究提出了DocDancer,一种端到端训练的开源文档智能体。我们将DocQA建模为信息检索问题,并提出一种工具驱动的智能体框架,该框架显式建模文档探索与理解过程。为实现此类智能体的端到端训练,我们设计了“探索后合成”的数据生成流程,以解决DocQA领域高质量训练数据稀缺的问题。在合成数据上训练的模型,在两个长上下文文档理解基准测试(MMLongBench-Doc与DocBench)中展现出优异性能。进一步的分析为智能体工具设计与合成数据生成提供了有价值的见解。