Syntactic elements, such as word order and case markers, are fundamental in natural language processing. Recent studies show that syntactic information boosts language model performance and offers clues for people to understand their learning mechanisms. Unlike languages with a fixed word order such as English, Korean allows for varied word sequences, despite its canonical structure, due to case markers that indicate the functions of sentence components. This study explores whether Korean language models can accurately capture this flexibility. We note that incomplete word orders and omitted case markers frequently appear in ordinary Korean communication. To investigate this further, we introduce the Syntactically Incomplete Korean (SIKO) dataset. Through SIKO, we assessed Korean language models' flexibility with incomplete syntax and confirmed the dataset's training value. Results indicate these models reflect Korean's inherent flexibility, accurately handling incomplete inputs. Moreover, fine-tuning with SIKO enhances the ability to handle common incomplete Korean syntactic forms. The dataset's simple construction process, coupled with significant performance enhancements, solidifies its standing as an effective data augmentation technique.
翻译:句法要素(如词序与格标记)是自然语言处理的基础。近期研究表明,句法信息能提升语言模型性能,并为人们理解其学习机制提供线索。与英语等词序固定的语言不同,韩语虽具有规范结构,但由于格标记能标示句子成分的功能,允许词序灵活变化。本研究探讨韩语语言模型能否准确捕捉这种灵活性。我们注意到,不完整的词序与省略的格标记在普通韩语交流中频繁出现。为深入探究此问题,我们引入了句法不完整韩语(SIKO)数据集。通过SIKO,我们评估了韩语语言模型处理不完整句法的灵活性,并验证了该数据集的训练价值。结果表明,这些模型能反映韩语固有的灵活性,准确处理不完整输入。此外,使用SIKO进行微调可增强模型处理常见不完整韩语句法形式的能力。该数据集构建过程简单,且能带来显著的性能提升,巩固了其作为有效数据增强技术的地位。