Recent advancements in Korean large language models (LLMs) have spurred numerous benchmarks and evaluation methodologies, yet the lack of a standardized evaluation framework has led to inconsistent results and limited comparability. To address this, we introduce HRET Haerae Evaluation Toolkit, an open-source, self-evolving evaluation framework tailored specifically for Korean LLMs. HRET unifies diverse evaluation methods, including logit-based scoring, exact-match, language-inconsistency penalization, and LLM-as-a-Judge assessments. Its modular, registry-based architecture integrates major benchmarks (HAE-RAE Bench, KMMLU, KUDGE, HRM8K) and multiple inference backends (vLLM, HuggingFace, OpenAI-compatible endpoints). With automated pipelines for continuous evolution, HRET provides a robust foundation for reproducible, fair, and transparent Korean NLP research.
翻译:随着韩语大语言模型(LLMs)的快速发展,涌现出众多基准测试与评估方法,但标准化评估框架的缺失导致结果不一致且可比性受限。为此,我们推出HRET(Haerae Evaluation Toolkit),一个专为韩语LLMs设计的开源、自演进评估框架。HRET整合了多种评估方法,包括基于对数概率的评分、精确匹配、语言不一致性惩罚以及基于LLM的评判评估。其模块化、基于注册表的架构集成了主流基准测试(HAE-RAE Bench、KMMLU、KUDGE、HRM8K)与多种推理后端(vLLM、HuggingFace、OpenAI兼容端点)。通过支持持续演进的自动化流程,HRET为可复现、公平且透明的韩语自然语言处理研究提供了坚实基础。