Large Language Models (LLMs) have become a popular interface for human-AI interaction, supporting information seeking and task assistance through natural, multi-turn dialogue. To respond to users within multi-turn dialogues, the context-dependent user intent evolves across interactions, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Existing studies usually follow static rewrite, retrieve, and generate pipelines, which optimize different procedures separately and overlook the mixed-initiative action optimization simultaneously. Although the recent developments in deep search agents demonstrate the effectiveness in jointly optimizing retrieval and generation via reasoning, these approaches focus on single-turn scenarios, which might lack the ability to handle multi-turn interactions. We introduce a conversational agent that interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through reinforcement learning (RL) training with tailored rewards towards evolving user goals. The experimental results across four widely used conversational benchmarks demonstrate the effectiveness of our methods by surpassing several existing strong baselines.
翻译:大型语言模型已成为人机交互的流行界面,通过自然的多轮对话支持信息检索与任务协助。为在多轮对话中响应用户,依赖上下文的用户意图在交互过程中持续演化,这需要上下文理解、查询重构以及检索与生成间的动态协调。现有研究通常遵循静态的改写-检索-生成流程,这些方法分别优化不同环节,却忽视了混合主动行为的同步优化。尽管近期深度搜索代理的研究通过推理联合优化检索与生成已展现成效,但这些方法聚焦于单轮场景,可能缺乏处理多轮交互的能力。本文提出一种在对话轮次间交错进行搜索与推理的对话代理,通过针对演化用户目标设计的奖励函数进行强化学习训练,实现探索性与自适应行为的学习。在四个广泛使用的对话基准测试上的实验结果表明,我们的方法超越了多个现有强基线,验证了其有效性。