Improving Open-Domain Dialogue Evaluation with a Causal Inference Model

Effective evaluation methods remain a significant challenge for research on open-domain conversational dialogue systems. Explicit satisfaction ratings can be elicited from users, but users often do not provide ratings when asked, and those they give can be highly subjective. Post-hoc ratings by experts are an alternative, but these can be both expensive and complex to collect. Here, we explore the creation of automated methods for predicting both expert and user ratings of open-domain dialogues. We compare four different approaches. First, we train a baseline model using an end-to-end transformer to predict ratings directly from the raw dialogue text. The other three methods are variants of a two-stage approach in which we first extract interpretable features at the turn level that capture, among other aspects, user dialogue behaviors indicating contradiction, repetition, disinterest, compliments, or criticism. We project these features to the dialogue level and train a dialogue-level MLP regression model, a dialogue-level LSTM, and a novel causal inference model called counterfactual-LSTM (CF-LSTM) to predict ratings. The proposed CF-LSTM is a sequential model over turn-level features which predicts ratings using multiple regressors depending on hypotheses derived from the turn-level features. As a causal inference model, CF-LSTM aims to learn the underlying causes of a specific event, such as a low rating. We also bin the user ratings and perform classification experiments with all four models. In evaluation experiments on conversational data from the Alexa Prize SocialBot, we show that the CF-LSTM achieves the best performance for predicting dialogue ratings and classification.

翻译：有效评估方法仍是开放域对话系统研究中的重大挑战。虽然可以通过用户获取明确的满意度评分，但用户常在询问时拒绝评分，且已提供的评分往往具有高度主观性。专家事后评分是替代方案，但这类评分既昂贵又难以收集。本文探索了自动预测开放域对话中专家评分与用户评分的方法，比较了四种不同方案：首先，我们训练了基于端到端Transformer的基线模型，直接从原始对话文本预测评分；其余三种方法均采用两阶段变体，首先在话轮层面提取可解释特征，捕捉包括用户对话行为中的矛盾、重复、不感兴趣、称赞或批评在内的多方面信息。我们将这些特征投影至对话层面，训练了对话级MLP回归模型、对话级LSTM，以及名为反事实LSTM（CF-LSTM）的新型因果推断模型进行评分预测。提出的CF-LSTM是基于话轮特征序列的模型，通过依据话轮特征推导出的假设使用多个回归器预测评分。作为因果推断模型，CF-LSTM旨在学习特定事件（如低评分）的根本原因。我们还对用户评分进行分箱处理，并采用全部四种模型执行分类实验。基于Alexa Prize SocialBot对话数据的评估实验表明，CF-LSTM在对话评分预测与分类任务中均达到最优性能。