The advent of Large Language Models (LLMs) has drastically enhanced dialogue systems. However, comprehensively evaluating the dialogue abilities of LLMs remains a challenge. Previous benchmarks have primarily focused on single-turn dialogues or provided coarse-grained and incomplete assessments of multi-turn dialogues, overlooking the complexity and fine-grained nuances of real-life dialogues. To address this issue, we introduce MT-Bench-101, specifically designed to evaluate the fine-grained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives and observing differing trends in LLMs performance across dialogue turns within various tasks. Further analysis indicates that neither utilizing common alignment techniques nor chat-specific designs has led to obvious enhancements in the multi-turn abilities of LLMs. Extensive case studies suggest that our designed tasks accurately assess the corresponding multi-turn abilities. The data and code are available at \url{https://github.com/mtbench101/mt-bench-101}.
翻译:大语言模型(LLMs)的出现极大地增强了对话系统的能力。然而,全面评估LLMs的对话能力仍然是一个挑战。以往的基准主要关注单轮对话,或仅对多轮对话提供粗粒度且不完整的评估,忽视了现实对话的复杂性和细粒度差异。为解决这一问题,我们引入了MT-Bench-101,该基准专门用于评估LLMs在多轮对话中的细粒度能力。通过对真实多轮对话数据进行详细分析,我们构建了一个三层级的能力分类体系,涵盖了13个不同任务中的1388个多轮对话,共计4208轮对话。随后,我们基于MT-Bench-101评估了21个流行的LLMs,从能力和任务两个角度进行了全面分析,并观察到LLMs在不同任务中随对话轮次变化的性能趋势存在差异。进一步分析表明,无论是采用常见的对齐技术还是针对聊天的特定设计,均未显著提升LLMs的多轮对话能力。大量的案例研究表明,我们设计的任务能够准确评估相应的多轮对话能力。数据和代码可在 \url{https://github.com/mtbench101/mt-bench-101} 获取。