We study improving social conversational agents by learning from natural dialogue between users and a deployed model, without extra annotations. To implicitly measure the quality of a machine-generated utterance, we leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes. Our experiments use the publicly released deployment data from BlenderBot (Xu et al., 2023). Human evaluation indicates improvements in our new models over baseline responses; however, we find that some proxy signals can lead to more generations with undesirable properties as well. For example, optimizing for conversation length can lead to more controversial or unfriendly generations compared to the baseline, whereas optimizing for positive sentiment or reaction can decrease these behaviors.
翻译:我们研究通过从用户与已部署模型之间的自然对话中学习(无需额外标注)来改进社交对话代理。为隐式衡量机器生成话语的质量,我们利用用户响应长度、情感倾向以及收集到的对话片段中未来人类话语的反应等信号。实验使用BlenderBot(Xu等人,2023年)公开的部署数据。人工评估表明,新模型在基线响应基础上有所改进;然而,我们发现某些代理信号也可能导致更多具有不良属性的生成结果。例如,优化对话长度可能导致与基线相比更易引发争议或不友好的生成内容,而优化积极情感或反应则可减少此类行为。