Agentic Retrieval-Augmented Generation (RAG) empowers large language models to autonomously plan and retrieve information for complex problem-solving. However, the development of robust agents is hindered by the scarcity of high-quality training data that reflects the noise and complexity of real-world retrieval environments. Conventional manual annotation is unscalable and often fails to capture the dynamic reasoning strategies required to handle retrieval failures. To bridge this gap, we introduce RAGShaper, a novel data synthesis framework designed to automate the construction of RAG tasks and robust agent trajectories. RAGShaper incorporates an InfoCurator to build dense information trees enriched with adversarial distractors spanning Perception and Cognition levels. Furthermore, we propose a constrained navigation strategy that forces a teacher agent to confront these distractors, thereby eliciting trajectories that explicitly demonstrate error correction and noise rejection. Comprehensive experiments confirm that models trained on our synthesized corpus significantly outperform existing baselines, exhibiting superior robustness in noise-intensive and complex retrieval tasks.
翻译:智能体检索增强生成(RAG)使大型语言模型能够自主规划并检索信息,以解决复杂问题。然而,由于缺乏反映真实世界检索环境噪声与复杂性的高质量训练数据,鲁棒智能体的发展受到阻碍。传统的人工标注方法难以扩展,且往往无法捕捉处理检索失败所需的动态推理策略。为弥补这一差距,我们提出了RAGShaper,一种新颖的数据合成框架,旨在自动化构建RAG任务和鲁棒的智能体轨迹。RAGShaper引入了一个信息策展模块,用于构建密集的信息树,其中包含了跨越感知与认知层面的对抗性干扰信息。此外,我们提出了一种约束导航策略,迫使教师智能体直面这些干扰,从而激发出明确展示错误纠正与噪声抑制能力的轨迹。全面的实验证实,使用我们合成语料训练的模型显著优于现有基线,在噪声密集和复杂的检索任务中表现出卓越的鲁棒性。