Recent advancements in Korean large language models (LLMs) have driven numerous benchmarks and evaluation methods, yet inconsistent protocols cause up to 10 p.p performance gaps across institutions. Overcoming these reproducibility gaps does not mean enforcing a one-size-fits-all evaluation. Rather, effective benchmarking requires diverse experimental approaches and a framework robust enough to support them. To this end, we introduce HRET (Haerae Evaluation Toolkit), an open-source, registry-based framework that unifies Korean LLM assessment. HRET integrates major Korean benchmarks, multiple inference backends, and multi-method evaluation, with language consistency enforcement to ensure genuine Korean outputs. Its modular registry design also enables rapid incorporation of new datasets, methods, and backends, ensuring the toolkit adapts to evolving research needs. Beyond standard accuracy metrics, HRET incorporates Korean-focused output analyses-morphology-aware Type-Token Ratio (TTR) for evaluating lexical diversity and systematic keyword-omission detection for identifying missing concepts-to provide diagnostic insights into language-specific behaviors. These targeted analyses help researchers pinpoint morphological and semantic shortcomings in model outputs, guiding focused improvements in Korean LLM development.
翻译:近期韩语大语言模型(LLMs)的发展催生了众多基准测试与评估方法,但评估协议的不一致导致不同机构间的性能差异高达10个百分点。克服这些可复现性差距并不意味着强制推行单一评估方案。相反,有效的基准测试需要多样化的实验方法以及能够支撑这些方法的稳健框架。为此,我们推出HRET(Haerae评估工具包)——一个基于注册机制的开源框架,旨在统一韩语LLM的评估标准。HRET整合了主流韩语基准测试集、多推理后端及多方法评估体系,并通过语言一致性强制机制确保生成纯正的韩语输出。其模块化注册设计还能快速集成新数据集、评估方法与后端系统,确保工具包能适应不断演进的研究需求。除标准准确率指标外,HRET引入了针对韩语特性的输出分析功能:通过形态感知型类符-形符比(TTR)评估词汇多样性,以及系统性关键词缺失检测机制识别遗漏概念,从而为语言特异性行为提供诊断性洞见。这些定向分析能帮助研究者精确定位模型输出中的形态学与语义缺陷,为韩语LLM开发的针对性改进提供指引。