Large language models have exhibited significant enhancements in performance across various tasks. However, the complexity of their evaluation increases as these models generate more fluent and coherent content. Current multilingual benchmarks often use translated English versions, which may incorporate Western cultural biases that do not accurately assess other languages and cultures. To address this research gap, we introduce KULTURE Bench, an evaluation framework specifically designed for Korean culture that features datasets of cultural news, idioms, and poetry. It is designed to assess language models' cultural comprehension and reasoning capabilities at the word, sentence, and paragraph levels. Using the KULTURE Bench, we assessed the capabilities of models trained with different language corpora and analyzed the results comprehensively. The results show that there is still significant room for improvement in the models' understanding of texts related to the deeper aspects of Korean culture.
翻译:大型语言模型在各类任务中展现出显著的性能提升。然而,随着模型生成内容愈发流畅连贯,其评估复杂性也相应增加。当前的多语言基准测试常采用英译版本,这些版本可能包含西方文化偏见,无法准确评估其他语言与文化。为填补这一研究空白,我们提出了KULTURE Bench——一个专为韩国文化设计的评估框架,其数据集涵盖文化新闻、习语与诗歌。该基准旨在从词汇、句子和段落三个层面评估语言模型的文化理解与推理能力。借助KULTURE Bench,我们对基于不同语料库训练的语言模型进行了能力评估,并对结果进行了综合分析。结果表明,现有模型在对韩国文化深层内涵相关文本的理解方面仍有显著提升空间。