We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair Encoding (BPE) to Korean, a language characterized by its rich morphology and unique writing system. Our approach balances linguistic accuracy with computational efficiency in Pre-trained Language Models (PLMs). Our evaluations show that this technique achieves good performances overall, notably improving results in the syntactic task of NIKL-CoLA. This suggests that integrating morpheme type information can enhance language models' syntactic and semantic capabilities, indicating that adopting more linguistic insights can further improve performance beyond standard morphological analysis.
翻译:我们提出了一种基于形态感知的子词切分方法,该方法利用子字符分解来解决将字节对编码(BPE)应用于韩语时面临的挑战——韩语是一种以丰富形态和独特书写系统为特征的语言。我们的方法在预训练语言模型(PLMs)中平衡了语言准确性与计算效率。评估表明,该技术整体表现良好,尤其在句法任务NIKL-CoLA中取得了显著改进。这表明融合语素类型信息可以增强语言模型的句法与语义能力,同时说明在标准形态分析之外采用更多语言学见解能够进一步提升性能。