A Transformer-based Response Evaluator for Open-Domain Spoken Conversation

Many open-domain dialogue systems rely on multiple response generators, any of which can contribute a response to the dialogue in a particular context. Thus the ability to compare potential responses and then select the best plays an important role in ensuring a dialogue system is coherent and engaging. Dialogue coherence goes beyond simply remaining on topic -- some trivia may be on topic and engaging when mentioned out of the blue, but may not be coherent and grounded in the context of the conversation. We carry out experiments on response selection in the Athena system, an Alexa Prize SocialBot that has dedicated content and multiple topic-specific response generators for a large number of topics. First, we collect a corpus of Athena conversations with live human traffic, where potential responses from all enabled response generators are logged and subsequently annotated for response quality. We compare several off-the-shelf response ranking methods for open-domain dialogue to Athena-Heuristic, a heuristic response ranker that was field-tested in Athena during the third Alexa Prize competition. We also compare these to a transformer-based response ranker we call Athena-RR, that we train on our Athena conversations. Athena-RR uses both the conversational context and the dialogue state to rank the potential responses. We find that Athena-RR with a Recall@1 of 70.79\% outperforms Athena-Heuristic and all of the off-the-shelf rankers by a large margin. We then conduct a live A/B study comparing Athena-Heuristic to Athena-RR in a 6,358 conversations with Alexa users. We show that Athena-RR leads to significantly longer conversations that receive significantly higher user ratings than the heuristic rule-based ranker.

翻译：许多开放域对话系统依赖多个响应生成器，其中任何一个都可能在特定对话上下文中贡献响应。因此，比较潜在响应并选择最佳响应的能力，对于确保对话系统既连贯又引人入胜起着重要作用。对话连贯性不仅限于保持主题一致——某些琐事内容可能在突然提及时既相关又有趣，但在对话上下文中可能显得不连贯且缺乏依据。我们在Athena系统中进行了响应选择实验，该系统是Alexa Prize的社交机器人，拥有专门的内容和多个针对大量主题的特定主题响应生成器。首先，我们收集了Athena与真实人类流量交互的对话语料库，其中记录了所有已启用响应生成器的潜在响应，并随后对响应质量进行了标注。我们比较了几种用于开放域对话的现成响应排序方法，并将其与Athena-Heuristic（一种在第三届Alexa Prize竞赛期间在Athena中经过现场测试的启发式响应排序器）进行对比。此外，我们还将这些方法与基于Transformer的响应排序器（我们称之为Athena-RR）进行了比较，该排序器基于我们的Athena对话数据进行训练。Athena-RR同时利用对话上下文和对话状态对潜在响应进行排序。我们发现，Recall@1达到70.79%的Athena-RR大幅优于Athena-Heuristic和所有现成排序器。随后，我们进行了一项在线A/B测试，在与Alexa用户的6,358次对话中比较了Athena-Heuristic和Athena-RR。结果表明，与基于启发式规则的排序器相比，Athena-RR带来了显著更长的对话，并获得了显著更高的用户评分。