The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea's lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.
翻译:韩语的历史特征在于其口语与书面形式之间的差异,以及从汉字到韩文字母的关键转变。然而,由于缺乏可访问的历史语料库,这一语言演变在自然语言处理领域仍基本未被探索。为填补这一空白,我们引入了开放韩语历史语料库,这是一个大规模、开放许可的数据集,跨越1300年,涵盖6种语言以及如韩式汉文(吏读)和韩汉混写体等代表性不足的文字系统。该语料库包含来自19个来源的1800万份文档和50亿个词元,时间跨度从7世纪至2025年。我们利用这一资源定量分析了主要的语言变迁:(1) 吏读的使用在19世纪60年代达到顶峰后急剧下降;(2) 从汉字到韩文的转变始于约1890年,是一次快速转型;(3) 朝鲜的词汇分化导致现代分词器产生高达51倍的未登录词率。这项工作通过捕捉韩语的历史,为定量历时分析提供了基础资源。此外,它可作为大语言模型的预训练语料库,有望提升其对现代韩文中的汉字词以及古文字系统的理解。