HRET：面向韩语大语言模型的自演进评估工具包 (HRET: A Self-Evolving LLM Evaluation Toolkit for Korean)

Recent advancements in Korean large language models (LLMs) have spurred numerous benchmarks and evaluation methodologies, yet the lack of a standardized evaluation framework has led to inconsistent results and limited comparability. To address this, we introduce HRET Haerae Evaluation Toolkit, an open-source, self-evolving evaluation framework tailored specifically for Korean LLMs. HRET unifies diverse evaluation methods, including logit-based scoring, exact-match, language-inconsistency penalization, and LLM-as-a-Judge assessments. Its modular, registry-based architecture integrates major benchmarks (HAE-RAE Bench, KMMLU, KUDGE, HRM8K) and multiple inference backends (vLLM, HuggingFace, OpenAI-compatible endpoints). With automated pipelines for continuous evolution, HRET provides a robust foundation for reproducible, fair, and transparent Korean NLP research.

翻译：随着韩语大语言模型（LLMs）的快速发展，涌现出众多基准测试与评估方法，但标准化评估框架的缺失导致结果不一致且可比性受限。为此，我们推出HRET（Haerae Evaluation Toolkit），一个专为韩语LLMs设计的开源、自演进评估框架。HRET整合了多种评估方法，包括基于对数概率的评分、精确匹配、语言不一致性惩罚以及基于LLM的评判评估。其模块化、基于注册表的架构集成了主流基准测试（HAE-RAE Bench、KMMLU、KUDGE、HRM8K）与多种推理后端（vLLM、HuggingFace、OpenAI兼容端点）。通过支持持续演进的自动化流程，HRET为可复现、公平且透明的韩语自然语言处理研究提供了坚实基础。

相关内容

大语言模型

关注 65

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日