Large Language Models (LLMs) pretrained on massive corpora exhibit remarkable capabilities across a wide range of tasks, however, the attention given to non-English languages has been limited in this field of research. To address this gap and assess the proficiency of language models in the Korean language and culture, we present HAE-RAE Bench, covering 6 tasks including vocabulary, history, and general knowledge. Our evaluation of language models on this benchmark highlights the potential advantages of employing Large Language-Specific Models(LLSMs) over a comprehensive, universal model like GPT-3.5. Remarkably, our study reveals that models approximately 13 times smaller than GPT-3.5 can exhibit similar performance levels in terms of language-specific knowledge retrieval. This observation underscores the importance of homogeneous corpora for training professional-level language-specific models. On the contrary, we also observe a perplexing performance dip in these smaller LMs when they are tasked to generate structured answers.
翻译:在大规模语料库上预训练的大语言模型(LLMs)展现出跨广泛任务的卓越能力,然而,该研究领域对非英语语言的关注仍然有限。为填补这一空白并评估语言模型在韩语及韩国文化方面的熟练程度,我们提出了HAE-RAE Bench,涵盖词汇、历史及常识知识等6项任务。我们在此基准上对语言模型的评估揭示了采用大规模语言专用模型(LLSMs)相较于GPT-3.5等通用全能模型的潜在优势。值得注意的是,本研究发现,体积约为GPT-3.5十三分之一的模型在语言特定知识检索中可表现出相似性能水平。这一观察强调了同质语料库对于训练专业级语言专用模型的重要性。相反,我们亦观察到这些较小语言模型在需生成结构化答案时出现令人费解的性能下降现象。