Despite the rapid development of large language models (LLMs) for the Korean language, there remains an obvious lack of benchmark datasets that test the requisite Korean cultural and linguistic knowledge. Because many existing Korean benchmark datasets are derived from the English counterparts through translation, they often overlook the different cultural contexts. For the few benchmark datasets that are sourced from Korean data capturing cultural knowledge, only narrow tasks such as bias and hate speech detection are offered. To address this gap, we introduce a benchmark of Cultural and Linguistic Intelligence in Korean (CLIcK), a dataset comprising 1,995 QA pairs. CLIcK sources its data from official Korean exams and textbooks, partitioning the questions into eleven categories under the two main categories of language and culture. For each instance in CLIcK, we provide fine-grained annotation of which cultural and linguistic knowledge is required to answer the question correctly. Using CLIcK, we test 13 language models to assess their performance. Our evaluation uncovers insights into their performances across the categories, as well as the diverse factors affecting their comprehension. CLIcK offers the first large-scale comprehensive Korean-centric analysis of LLMs' proficiency in Korean culture and language.
翻译:尽管针对韩语的大型语言模型(LLMs)发展迅速,但测试所需韩语文化与语言知识的基准数据集仍明显匮乏。由于许多现有的韩语基准数据集是通过翻译英语对应数据集而来,它们往往忽略了不同的文化背景。少数基于韩语数据构建的体现文化知识的基准数据集仅提供偏见与仇恨言论检测等狭窄任务。为弥补这一缺口,我们推出了韩语文化与语言智能基准数据集(CLIcK),该数据集包含1,995个问答对。CLIcK的数据来源于韩语官方考试和教科书,将问题划分为语言和文化两大主类别下的11个子类别。对于CLIcK中的每个实例,我们提供了细粒度标注,说明正确回答问题所需的文化与语言知识。我们利用CLIcK测试了13个语言模型的性能。评估结果揭示了这些模型在不同类别上的表现,以及影响其理解的多种因素。CLIcK首次提供了对LLMs在韩语文化与语言能力上的大规模、全面且以韩语为中心的分析。