Customizing persuasive conversations related to the outcome of interest for specific users achieves better persuasion results. However, existing persuasive conversation systems rely on persuasive strategies and encounter challenges in dynamically adjusting dialogues to suit the evolving states of individual users during interactions. This limitation restricts the system's ability to deliver flexible or dynamic conversations and achieve suboptimal persuasion outcomes. In this paper, we present a novel approach that tracks a user's latent personality dimensions (LPDs) during ongoing persuasion conversation and generates tailored counterfactual utterances based on these LPDs to optimize the overall persuasion outcome. In particular, our proposed method leverages a Bi-directional Generative Adversarial Network (BiCoGAN) in tandem with a Dialogue-based Personality Prediction Regression (DPPR) model to generate counterfactual data. This enables the system to formulate alternative persuasive utterances that are more suited to the user. Subsequently, we utilize the D3QN model to learn policies for optimized selection of system utterances on counterfactual data. Experimental results we obtained from using the PersuasionForGood dataset demonstrate the superiority of our approach over the existing method, BiCoGAN. The cumulative rewards and Q-values produced by our method surpass ground truth benchmarks, showcasing the efficacy of employing counterfactual reasoning and LPDs to optimize reinforcement learning policy in online interactions.
翻译:摘要:针对特定用户定制与感兴趣结果相关的说服性对话,能够取得更好的说服效果。然而,现有说服性对话系统依赖于说服策略,并在动态调整对话以适应交互过程中个体用户的演变状态方面面临挑战。这一局限性限制了系统提供灵活或动态对话的能力,并导致说服效果欠佳。本文提出了一种新颖方法,该方法在持续的说服对话中追踪用户的潜在人格维度(LPDs),并基于这些LPDs生成定制的反事实语句,以优化整体说服结果。具体而言,我们提出的方法结合了双向生成对抗网络(BiCoGAN)与基于对话的人格预测回归(DPPR)模型,以生成反事实数据。这使得系统能够形成更适应用户的替代性说服语句。随后,我们利用D3QN模型学习策略,以在反事实数据上优化系统语句的选择。使用PersuasionForGood数据集获得的实验结果表明,我们的方法优于现有BiCoGAN方法。我们的方法产生的累积奖励和Q值超过了真实基准,展示了利用反事实推理和LPDs在在线交互中优化强化学习策略的有效性。