We describe an approach for aligning an LLM-based dialogue agent based on global (i.e., dialogue-level) rewards, while also taking into account naturally-occurring multimodal signals. At a high level, our approach (dubbed GELI) learns a local, turn-level reward model by decomposing the human-provided Global Explicit (GE) session-level reward, using Local Implicit (LI} multimodal reward signals to crossmodally shape the reward decomposition step. This decomposed reward model is then used as part of the standard RHLF pipeline improve an LLM-based dialog agent. We run quantitative and qualitative human studies to evaluate the performance of our GELI approach, and find that it shows consistent improvements across various conversational metrics compared to baseline methods.
翻译:我们描述了一种方法,用于基于全局(即对话层级)奖励来对齐基于大语言模型的对话智能体,同时考虑自然产生的多模态信号。在高层面上,我们的方法(简称GELI)通过利用局部隐式(LI)多模态奖励信号跨模态地塑造奖励分解步骤,学习一个基于人类提供的全局显式(GE)会话层级奖励分解的局部、轮次层级奖励模型。该分解后的奖励模型随后被用作标准RHLF流程的一部分,以改进基于大语言模型的对话智能体。我们开展了定量和定性的人类研究来评估GELI方法的性能,发现与基线方法相比,它在各种对话指标上均表现出一致的改进。