Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.
翻译:大语言模型(LLMs)作为被动响应者表现出色,但将其训练为主动、目标导向的合作伙伴——在高风险领域中的关键能力——仍是一个重大挑战。现有范式要么短视地优化单轮属性,要么依赖脆弱且高成本的用户模拟器,造成了持久的“现实鸿沟”。为弥合这一鸿沟,我们提出了 \\texttt{Learn-to-Ask},一个通用、无需模拟器的框架,用于直接从离线专家数据中学习并部署主动式对话代理,绕过了对复杂用户动态建模的需求。我们的核心洞见是通过利用每个专家轨迹的\\textbf{已观测未来}来重构离线策略学习问题。这使我们能够基于专家揭示的策略推断出密集的、逐轮次的奖励信号,将难以处理的长时程问题分解为一系列监督学习任务,并训练策略输出结构化的 \\texttt{(动作,状态评估)} 元组,以同时控制\\textbf{询问什么}以及至关重要的\\textbf{何时停止}。为确保奖励保真度,我们的自动化评分器校准流程以最少的人工监督,系统性地清除基于LLM的奖励模型中的噪声。实证上,我们在一个真实世界医疗数据集中,使用参数规模最高达320亿的多种LLM,验证了 \\texttt{Learn-to-Ask} 的有效性。我们的方法最终成功将LLM部署到一个大规模在线AI服务中。在严格的内部评估中,我们的模型上线后表现甚至优于人类专家,证明了该框架能够将离线数据转化为切实的现实世界影响力。我们希望这项工作为将被动LLM转变为主动、目标导向的LLM应用提供一份实用且经济可行的蓝图。