Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.
翻译:在许多语言模型的实际应用中,如写作辅助和代码自动补全,都涉及人与语言模型的交互。然而,当前多数基准测试都是非交互式的,即模型在无人类参与的情况下生成输出。为评估人-语言模型交互,我们提出新框架——人类与AI的语言交互评估框架(HALIE),该框架定义了交互系统的组成要素,以及设计评估指标时需考虑的维度。相较于标准的非交互式评估,HALIE能捕捉:(i) 交互过程而不仅是最终输出;(ii) 第一人称主观体验而非第三方评估;(iii) 超越质量维度(如愉悦感和所有权感)的偏好概念。我们设计了五项任务覆盖不同交互形式:社交对话、问答、填字游戏、文本摘要及隐喻生成。通过测试四个最先进的语言模型(OpenAI的GPT-3三个变体与AI21 Labs的Jurassic-1),发现非交互式性能更优并不总能转化为更佳的人-语言模型交互效果。我们重点展示了三个非交互式与交互式指标结果相背离的案例,强调人-语言模型交互对语言模型评估的重要性。