Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users' traits, but existing methods face critical limitations: over-reliance on pre-collected user data, and short-horizon biases in reinforcement learning (RL) that neglect long-term dialogue value. To address these, we propose a novel long-horizon RL framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO). Adopting a two-agent game paradigm, a user agent constructs dynamic environments via style mimicry (learning user-specific conversational traits) and active termination (predicting turn-level termination probabilities as immediate rewards), forming an iterative cycle that drives the dialogue agent to deepen interest exploration. AT-GRPO reinterprets dialogue trajectories as trees and introduces adaptive observation ranges. Unlike full tree expansion that incurs exponential overhead, it limits each node to aggregate rewards from a stage-aware range: larger ranges support early-stage topic exploration, while smaller ranges facilitate late-stage dialogue maintenance. This design reduces rollout budgets from exponential to polynomial in the dialogue length, while preserving long-term reward capture. Extensive experiments show our framework's superior performance, sample efficiency, and robustness.
翻译:开放式对话智能体旨在通过适应用户特质实现引人入胜的个性化交互,但现有方法存在关键局限:过度依赖预收集的用户数据,以及强化学习中忽视长期对话价值的短视偏差。为解决这些问题,我们提出了一种融合在线个性化与自适应树状组相对策略优化的新型长程强化学习框架。该框架采用双智能体博弈范式:用户智能体通过风格模仿(学习用户特定的对话特质)与主动终止(预测轮次级终止概率作为即时奖励)构建动态环境,形成驱动对话智能体深化兴趣探索的迭代循环。AT-GRPO将对话轨迹重新阐释为树状结构并引入自适应观测范围。不同于产生指数级开销的完整树扩展,该方法将每个节点的奖励聚合限制在阶段感知范围内:较大范围支持早期话题探索,较小范围则促进后期对话维护。该设计将对话长度对应的计算开销从指数级降至多项式级,同时保持长期奖励捕获能力。大量实验表明,本框架在性能、样本效率与鲁棒性方面均具有显著优势。