New Natural Langauge Process~(NLP) benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge. Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty and Xiezhi-Interdiscipline, both with 15k questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management. We anticipate Xiezhi will help analyze important strengths and shortcomings of LLMs, and the benchmark is released in~\url{https://github.com/MikeGu721/XiezhiBenchmark}.
翻译:为了与大型语言模型(LLM)的快速发展保持一致,亟需新的自然语言处理(NLP)基准。我们提出谢知(Xiezhi),这是目前最全面的评估套件,旨在评估全领域知识。谢知包含涵盖13个不同学科、516门学科的249,587道选择题,并附带谢知-专业领域(Xiezhi-Specialty)和谢知-交叉学科(Xiezhi-Interdiscipline)两个各含15,000道题目的子集。我们在谢知上对47个前沿LLM进行了评估。结果表明,LLM在科学、工程、农学、医学和艺术领域的表现超过人类平均水平,但在经济学、法学、教育学、文学、历史和管理学方面仍有不足。我们期待谢知能帮助分析LLM的重要优势与不足,该基准已在https://github.com/MikeGu721/XiezhiBenchmark 发布。