Large language models (LLMs) have shown impressive capabilities across various natural language tasks. However, evaluating their alignment with human preferences remains a challenge. To this end, we propose a comprehensive human evaluation framework to assess LLMs' proficiency in following instructions on diverse real-world tasks. We construct a hierarchical task tree encompassing 7 major areas covering over 200 categories and over 800 tasks, which covers diverse capabilities such as question answering, reasoning, multiturn dialogue, and text generation, to evaluate LLMs in a comprehensive and in-depth manner. We also design detailed evaluation standards and processes to facilitate consistent, unbiased judgments from human evaluators. A test set of over 3,000 instances is released, spanning different difficulty levels and knowledge domains. Our work provides a standardized methodology to evaluate human alignment in LLMs for both English and Chinese. We also analyze the feasibility of automating parts of evaluation with a strong LLM (GPT-4). Our framework supports a thorough assessment of LLMs as they are integrated into real-world applications. We have made publicly available the task tree, TencentLLMEval dataset, and evaluation methodology which have been demonstrated as effective in assessing the performance of Tencent Hunyuan LLMs. By doing so, we aim to facilitate the benchmarking of advances in the development of safe and human-aligned LLMs.
翻译:大语言模型(LLMs)在各种自然语言任务中展现出令人瞩目的能力,但评估其与人类偏好的一致性仍面临挑战。为此,我们提出了一种综合性人工评估框架,用于评估LLMs在多样化真实世界任务中遵循指令的能力。我们构建了一个包含7大领域、覆盖200余类别和800多项任务的层次化任务树,涵盖问答、推理、多轮对话及文本生成等多种能力,以全面深入地评估LLMs。同时,我们设计了详细的评估标准与流程,以确保人类评估者做出一致且无偏的判断。我们发布了一个包含3000多个测试实例的测试集,覆盖不同难度级别和知识领域。本工作为评估LLMs在英文和中文环境下的人类对齐能力提供了标准化方法。我们还分析了利用强LLM(GPT-4)自动完成部分评估的可行性。该框架支持对LLMs在集成至真实应用时进行充分评估。我们已公开任务树、TencentLLMEval数据集及评估方法,这些成果在评估腾讯混元LLMs性能中验证有效。通过此举,我们旨在推动发展安全且与人类对齐的LLMs的基准测试进展。