The Natural Conversation Benchmark (NC-Bench) introduces a new approach to evaluating the general conversational competence of large language models (LLMs). Unlike prior benchmarks that focus on the content of model behavior, NC-Bench focuses on the form and structure of natural conversation. Grounded in the IBM Natural Conversation Framework (NCF), NC-Bench comprises three distinct sets: (1) the basic set evaluates fundamental sequence management practices, such as answering inquiries, repairing responses, and closing conversational pairs; (2) the retrieval-augmented generation (RAG) set applies the same sequence management patterns as the first set but incorporates information-seeking via RAG; (3) the complex request set extends to requests involving more intricate sequence management patterns. Each set tests a model's ability to produce contextually appropriate conversational actions in response to characteristic interaction patterns. Initial evaluations across six open-source models and 14 interaction patterns show that models perform well on basic answering tasks, struggle more with repair tasks (especially repeat), have mixed performance on closing sequences, and find complex multi-turn requests most challenging. By operationalizing fundamental principles of human conversation, NC-Bench provides a lightweight, extensible, and theory-grounded framework for assessing and improving the conversational abilities of LLMs beyond topical or task-specific benchmarks.
翻译:自然对话基准(NC-Bench)提出了一种评估大语言模型(LLM)通用对话能力的新方法。与先前关注模型行为内容的基准不同,NC-Bench侧重于自然对话的形式与结构。该基准基于IBM自然对话框架(NCF),包含三个不同的集合:(1)基础集合评估基本的序列管理实践,例如回答询问、修正回复和结束对话对;(2)检索增强生成(RAG)集合应用与第一集合相同的序列管理模式,但通过RAG融入了信息检索;(3)复杂请求集合则扩展到涉及更复杂序列管理模式的请求。每个集合都测试模型在响应特征性交互模式时,生成上下文适当对话行为的能力。对六个开源模型和14种交互模式的初步评估表明,模型在基础回答任务上表现良好,在修正任务(尤其是重复修正)上表现更困难,在结束序列任务上表现参差不齐,并且发现复杂的多轮请求最具挑战性。通过将人类对话的基本原则操作化,NC-Bench提供了一个轻量级、可扩展且理论基础的框架,用于在超越主题或特定任务基准的层面上评估和改进LLM的对话能力。