The advent of Large Language Models (LLMs) has drastically enhanced dialogue systems. However, comprehensively evaluating the dialogue abilities of LLMs remains a challenge. Previous benchmarks have primarily focused on single-turn dialogues or provided coarse-grained and incomplete assessments of multi-turn dialogues, overlooking the complexity and fine-grained nuances of real-life dialogues. To address this issue, we introduce MT-Bench-101, specifically designed to evaluate the fine-grained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives and observing differing trends in LLMs performance across dialogue turns within various tasks. Further analysis indicates that neither utilizing common alignment techniques nor chat-specific designs has led to obvious enhancements in the multi-turn abilities of LLMs. Extensive case studies suggest that our designed tasks accurately assess the corresponding multi-turn abilities.
翻译:大型语言模型(LLMs)的出现极大增强了对话系统。然而,全面评估LLMs的对话能力仍是一项挑战。现有基准主要集中于单轮对话,或对多轮对话仅提供粗粒度且不完整的评估,忽略了真实对话的复杂性与细粒度差异。为解决此问题,我们提出MT-Bench-101,该基准专门设计用于评估LLMs在多轮对话中的细粒度能力。通过对真实多轮对话数据进行详细分析,我们构建了一个包含13个不同任务中1388个多轮对话共4208轮次的三级层次化能力分类体系。随后基于MT-Bench-101对21个主流LLMs进行评估,从能力和任务双重视角开展综合分析,并观察到不同任务中LLMs性能随对话轮次变化的差异化趋势。进一步分析表明,无论是采用通用对齐技术还是对话专用设计,均未显著提升LLMs的多轮对话能力。大量案例研究证实,我们设计的任务能够准确评估相应的多轮对话能力。