Personality-Aware Reinforcement Learning for Persuasive Dialogue with LLM-Driven Simulation

Effective persuasive dialogue agents adapt their strategies to individual users, accounting for the evolution of their psychological states and intentions throughout conversations. We present a personality-aware reinforcement learning approach comprising three main modules: (1) a Strategy-Oriented Interaction Framework, which serves as an agenda-based strategy controller that selects strategy-level actions and generate responses via Maximal Marginal Relevance (MMR) retrieval to ensure contextual relevance, diversity, and scalable data generation; (2) Personality-Aware User Representation Learning, which produces an 81-dimensional mixed-type embedding predicted at each turn from recent exchanges and appended to the reinforcement learning state; and (3) a Dueling Double DQN (D3QN) model and Reward Prediction, in which the policy is conditioned on dialogue history and turn-level personality estimates and trained using a composite reward incorporating agreement intent, donation amount, and changeof-mind penalties. We use an agenda-based LLM simulation pipeline to generate diverse interactions, from which personality estimation is inferred from the generated utterances. Experiments on the PersuasionForGood (P4G) dataset augmented with simulated dialogues reveal three main findings: (i) turn-level personality conditioning improves policy adaptability and cumulative persuasion rewards; (ii) LLM-driven simulation enhances generalization to unseen user behaviors; and (iii) incorporating a change-of-mind penalty reduces post-agreement retractions while slightly improving donation outcomes. These results demonstrate that structured interaction, dynamic personality estimation, and behaviorally informed rewards together yield more effective persuasive policies.

翻译：有效的劝说对话代理能够根据个体用户调整策略，考虑其心理状态和意图在对话过程中的动态演变。我们提出一种人格感知的强化学习方法，包含三个核心模块：(1) 策略导向的交互框架，作为基于议程的策略控制器，通过最大边际相关性（MMR）检索选择策略级动作并生成响应，确保上下文相关性、多样性和可扩展的数据生成；(2) 人格感知的用户表征学习，在每轮对话中根据近期交互预测81维混合类型嵌入，并将其附加至强化学习状态；(3) Dueling Double DQN（D3QN）模型与奖励预测模块，其策略以对话历史和轮次级人格估计为条件，并通过融合同意意图、捐赠金额及改变主意惩罚的复合奖励进行训练。我们采用基于议程的LLM模拟流程生成多样化交互，从生成的话语中推断人格估计。在通过模拟对话增强的PersuasionForGood（P4G）数据集上的实验揭示了三个主要发现：(i) 轮次级人格条件化提升了策略适应性与累积劝说奖励；(ii) LLM驱动模拟增强了模型对未见用户行为的泛化能力；(iii) 引入改变主意惩罚减少了达成一致后的反悔行为，同时轻微改善了捐赠结果。这些结果表明，结构化交互、动态人格估计与行为感知奖励机制共同催生了更有效的劝说策略。