As language models are often deployed as chatbot assistants, it becomes a virtue for models to engage in conversations in a user's first language. While these models are trained on a wide range of languages, a comprehensive evaluation of their proficiency in low-resource languages such as Korean has been lacking. In this work, we introduce KoDialogBench, a benchmark designed to assess language models' conversational capabilities in Korean. To this end, we collect native Korean dialogues on daily topics from public sources, or translate dialogues from other languages. We then structure these conversations into diverse test datasets, spanning from dialogue comprehension to response selection tasks. Leveraging the proposed benchmark, we conduct extensive evaluations and analyses of various language models to measure a foundational understanding of Korean dialogues. Experimental results indicate that there exists significant room for improvement in models' conversation skills. Furthermore, our in-depth comparisons across different language models highlight the effectiveness of recent training techniques in enhancing conversational proficiency. We anticipate that KoDialogBench will promote the progress towards conversation-aware Korean language models.
翻译:随着语言模型常被部署为聊天助手,模型以用户母语进行对话的能力已成为一项关键特性。尽管这些模型在多种语言上进行了训练,但对其在韩语等低资源语言中熟练程度的全面评估仍存在不足。为此,我们提出KoDialogBench——一个专为评估语言模型韩语对话能力设计的基准测试。我们通过从公共来源收集原生韩语日常对话语料,或从其他语言翻译对话内容,构建了涵盖对话理解与回答选择等多样化任务的测试数据集。基于该基准,我们对多种语言模型进行了广泛评估与分析,以衡量其对韩语对话的基础理解能力。实验结果表明,模型在对话技能方面仍有显著提升空间。此外,通过跨语言模型的深入对比,我们揭示了近期训练技术在增强对话熟练度方面的有效性。我们期望KoDialogBench能推动面向韩语对话的语言模型研究取得进展。