Large Language Models (LLMs) pretrained on massive corpora exhibit remarkable capabilities across a wide range of tasks, however, the attention given to non-English languages has been limited in this field of research. To address this gap and assess the proficiency of language models in the Korean language and culture, we present HAE-RAE Bench, covering 6 tasks including vocabulary, history, and general knowledge. Our evaluation of language models on this benchmark highlights the potential advantages of employing Large Language-Specific Models(LLSMs) over a comprehensive, universal model like GPT-3.5. Remarkably, our study reveals that models approximately 13 times smaller than GPT-3.5 can exhibit similar performance levels in terms of language-specific knowledge retrieval. This observation underscores the importance of homogeneous corpora for training professional-level language-specific models. On the contrary, we also observe a perplexing performance dip in these smaller LMs when they are tasked to generate structured answers.
翻译:在大规模语料库上预训练的大语言模型(LLMs)在各类任务中展现出卓越能力,然而该领域研究对非英语语言的关注仍十分有限。为弥补这一不足并评估语言模型在韩语及韩国文化方面的熟练程度,我们提出了HAE-RAE Bench,涵盖词汇、历史及常识知识等6项任务。基于该基准对语言模型的评估表明,采用大型专用语言模型(LLSMs)相较于GPT-3.5这类通用全能模型具有潜在优势。值得注意的是,本研究发现比GPT-3.5小约13倍的模型在语言专属知识检索方面可达到相近的性能水平。这一发现强调了同质语料库对训练专业级语言模型的重要性。与之相反,我们同时观察到这些较小的语言模型在生成结构化答案时出现了令人费解的性能下降现象。