Despite the rapid development of large language models (LLMs) for the Korean language, there remains an obvious lack of benchmark datasets that test the requisite Korean cultural and linguistic knowledge. Because many existing Korean benchmark datasets are derived from the English counterparts through translation, they often overlook the different cultural contexts. For the few benchmark datasets that are sourced from Korean data capturing cultural knowledge, only narrow tasks such as bias and hate speech detection are offered. To address this gap, we introduce a benchmark of Cultural and Linguistic Intelligence in Korean (CLIcK), a dataset comprising 1,995 QA pairs. CLIcK sources its data from official Korean exams and textbooks, partitioning the questions into eleven categories under the two main categories of language and culture. For each instance in CLIcK, we provide fine-grained annotation of which cultural and linguistic knowledge is required to answer the question correctly. Using CLIcK, we test 13 language models to assess their performance. Our evaluation uncovers insights into their performances across the categories, as well as the diverse factors affecting their comprehension. CLIcK offers the first large-scale comprehensive Korean-centric analysis of LLMs' proficiency in Korean culture and language.
翻译:尽管针对韩语的大语言模型(LLMs)发展迅速,但能够测试韩语所需文化与语言知识的基准数据集仍明显匮乏。由于许多现有韩语基准数据集是通过翻译英文对应数据集得到的,它们往往忽略不同的文化语境。即便有少数从韩语数据源中捕捉文化知识的基准数据集,也仅提供偏见与仇恨言论检测等狭窄任务。为填补这一空白,我们引入了韩语文化与语言智能基准(CLIcK)——一个包含1,995个问答对的数据集。CLIcK的数据源自韩国官方考试与教科书,将问题划分为语言与文化两大主类别下的十一个子类别。对于CLIcK中的每个实例,我们提供了细粒度标注,说明需要何种文化与语言知识才能正确回答问题。利用CLIcK,我们对13种语言模型进行了性能评估。评估结果揭示了这些模型在各类别上的表现差异,以及影响其理解能力的多种因素。CLIcK首次提供了以韩语为核心的大规模全面分析,探究了大语言模型在韩语文化与语言方面的熟练程度。