Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text-Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.
翻译:深度研究智能体在多步骤信息检索、推理及长文本报告生成方面展现出强大能力,但现有基准与系统仍以文本为中心,缺乏对视觉元素事实可靠性及其与周边分析文本对齐程度的系统评估。为填补这一空白,我们提出TVIR(文本-视觉交错报告生成),包含TVIR-Bench(由100个专家筛选的多模态深度研究任务构成的基准,要求视觉元素服务于特定分析子目标)与TVIR-Agent(一种层级化多智能体框架,作为构建大纲、检索图像、生成可追溯来源图表以及通过上下文感知顺序写作合成报告的强基线)。我们进一步开发了融合文本评估与视觉评估的双路径评价框架。在九个深度研究系统上的实验表明,TVIR-Agent获得了强劲的整体性能,凸显了面向证据驱动报告生成时显式多模态设计与评估的重要性。