Large language models (LLMs) have shown the potential to be integrated into human daily lives. Therefore, user preference is the most critical criterion for assessing LLMs' performance in real-world scenarios. However, existing benchmarks mainly focus on measuring models' accuracy using multi-choice questions, which limits the understanding of their capabilities in real applications. We fill this gap by proposing a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE). Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences achieved on open-ended ones. At the same time, they can complement each other to predict actual user preferences. We also demonstrate that GPT-4 is a reliable judge to automatically evaluate human preferences on open-ended questions in a Chinese context. Our benchmark will be released at https://www.CLUEbenchmarks.com
翻译:中文摘要:大型语言模型(LLMs)展现出融入人类日常生活的潜力。因此,在现实场景中评估LLMs性能时,用户偏好是最关键的评判标准。然而,现有基准测试主要聚焦于通过选择题评估模型准确性,这限制了对模型实际应用能力的理解。为填补这一空白,我们提出了一个综合性的中文基准测试SuperCLUE——其命名灵感源自另一个广受欢迎的中文LLM基准测试CLUE。SuperCLUE包含三个子任务:基于LLM对战平台(CArena)获取的真实用户查询与评分、单轮与多轮对话形式下的开放式问题(OPEN),以及与OPEN中相同主干内容的封闭式问题(CLOSE)。研究表明,封闭式问题的准确性不足以反映用户在开放式问题上体现的偏好,但两者可互补以预测真实用户偏好。我们还证明,GPT-4能够在中文语境下自动评估开放式问题中的人类偏好结果,其评估结果具有可靠性。本基准测试将于https://www.CLUEbenchmarks.com 公开发布。