Large Language Models (LLMs) pretrained on massive corpora exhibit remarkable capabilities across a wide range of tasks, however, the attention given to non-English languages has been limited in this field of research. To address this gap and assess the proficiency of language models in the Korean language and culture, we present HAE-RAE Bench, covering 6 tasks including vocabulary, history, and general knowledge. Our evaluation of language models on this benchmark highlights the potential advantages of employing Large Language-Specific Models(LLSMs) over a comprehensive, universal model like GPT-3.5. Remarkably, our study reveals that models approximately 13 times smaller than GPT-3.5 can exhibit similar performance levels in terms of language-specific knowledge retrieval. This observation underscores the importance of homogeneous corpora for training professional-level language-specific models. On the contrary, we also observe a perplexing performance dip in these smaller LMs when they are tasked to generate structured answers.
翻译:大规模语言模型(LLMs)在海量语料上预训练后展现出跨任务的卓越能力,然而非英语语言在该研究领域中受到的关注始终有限。为弥补这一空白并评估语言模型在韩语及韩国文化方面的水平,我们提出HAE-RAE Bench基准,涵盖词汇、历史及常识知识等6项任务。基于该基准对语言模型的评估凸显了采用专用大规模语言模型(LLSMs)相较于GPT-3.5等通用综合模型的潜在优势。值得注意的是,本研究发现规模约为GPT-3.5十三分之一的模型在语言特定知识检索方面能展现相似性能水平。这一观察强调了同质语料库对于训练专业级语言模型的重要性。相反,我们在这些较小语言模型生成结构化答案时也观察到令人困惑的性能下降现象。