In this paper, we study the problem of multi-reward reinforcement learning to jointly optimize for multiple text qualities for natural language generation. We focus on the task of counselor reflection generation, where we optimize the generators to simultaneously improve the fluency, coherence, and reflection quality of generated counselor responses. We introduce two novel bandit methods, DynaOpt and C-DynaOpt, which rely on the broad strategy of combining rewards into a single value and optimizing them simultaneously. Specifically, we employ non-contextual and contextual multi-arm bandits to dynamically adjust multiple reward weights during training. Through automatic and manual evaluations, we show that our proposed techniques, DynaOpt and C-DynaOpt, outperform existing naive and bandit baselines, showcasing their potential for enhancing language models.
翻译:本文研究了多奖励强化学习问题,旨在联合优化自然语言生成的多种文本质量。我们聚焦于咨询师反思生成任务,通过优化生成器同时提升生成的咨询师回复的流畅性、连贯性和反思质量。我们提出了两种新颖的波段方法——DynaOpt和C-DynaOpt,它们基于将多个奖励合并为单一值并同步优化的通用策略。具体而言,我们采用无上下文和多上下文多臂赌博机方法,在训练过程中动态调整多个奖励权重。通过自动评估和人工评估,我们证明提出的DynaOpt和C-DynaOpt技术优于现有的朴素方法和波段基线方法,展现了其在增强语言模型方面的潜力。