We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The agent must follow a multi-stage Standard Operating Procedure (SOP) and strict guardrails (no over-promising and no hallucinations), while remaining human-like and effective over long, multi-turn dialogues. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training method that combines heterogeneous rewards: a preference-trained reward model (RM), an LLM-as-a-judge (RJ) for nuanced behaviors (e.g., emotional value and SOP compliance), and rule-based reward functions (RF) (mainly regex-based) for deterministic checks on numerics, formatting, and guardrails. In expert consensus evaluation (three human experts; 30 online conversations and 45 curated bad cases), REPO improves average dialogue rating to 4.63 (+0.33 over GRPO) and raises the share of conversations with at least one excellent response to 66.67% (+23.34 pp over GRPO), while achieving a 93.33% bad-case fix rate with 75.56% clean fixes. In a production A/B test on 9,653 real customer conversations (vs. an intent-driven dialogue system), REPO improves response rate by +12.14 pp and task success rate by +5.94 pp (p<0.001).
翻译:我们将大语言模型(LLMs)部署为在线旅行社(OTAs)中进行说服性价格谈判的业务发展(BD)代理。该代理必须遵循多阶段标准作业程序(SOP)和严格护栏(禁止过度承诺和幻觉),同时在长程多轮对话中保持类人属性与有效性。我们提出奖励增强策略优化(REPO),这是一种结合异构奖励的强化学习后训练方法:偏好训练的奖励模型(RM)、用于细粒度行为(如情感价值和SOP合规性)的LLM-as-a-judge(RJ),以及基于规则的奖励函数(RF)(主要为正则表达式基础)用于数字、格式和护栏的确定性检查。在专家共识评估(三位人类专家;30场在线对话和45个精心策划的缺陷案例)中,REPO将平均对话评分提升至4.63(较GRPO提升+0.33),将至少包含一次优质回复的对话占比提升至66.67%(较GRPO提升+23.34个百分点),同时实现93.33%的缺陷案例修复率,其中75.56%为纯净修复。在针对9,653个真实客户对话的生产环境A/B测试中(对比基于意图的对话系统),REPO将回复率提升+12.14个百分点,任务成功率提升+5.94个百分点(p<0.001)。