CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

While many languages possess processes of joining two or more words to create compound words, previous studies have been typically limited only to languages with excessively productive compound formation (e.g., German, Dutch) and there is no public dataset containing compound and non-compound words across a large number of languages. In this work, we systematically study decompounding, the task of splitting compound words into their constituents, at a wide scale. We first address the data gap by introducing a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We then use this dataset to evaluate an array of Large Language Models (LLMs) on the decompounding task. We find that LLMs perform poorly, especially on words which are tokenized unfavorably by subword tokenization. We thus introduce a novel methodology to train dedicated models for decompounding. The proposed two-stage procedure relies on a fully self-supervised objective in the first stage, while the second, supervised learning stage optionally fine-tunes the model on the annotated Wiktionary data. Our self-supervised models outperform the prior best unsupervised decompounding models by 13.9% accuracy on average. Our fine-tuned models outperform all prior (language-specific) decompounding tools. Furthermore, we use our models to leverage decompounding during the creation of a subword tokenizer, which we refer to as CompoundPiece. CompoundPiece tokenizes compound words more favorably on average, leading to improved performance on decompounding over an otherwise equivalent model using SentencePiece tokenization.

翻译：摘要：尽管许多语言存在通过组合两个或多个单词构成复合词的过程，以往研究通常仅局限于复合词构词能力过强的语言（例如德语、荷兰语），且缺乏包含多种语言中复合词与非复合词的公开数据集。本研究系统性地在大规模范围探讨了复合词拆解——即把复合词拆分为其组成成分的任务。我们首先通过引入一个包含56种不同语言、255k个复合词与非复合词的维基词典数据集，填补了数据空白。随后利用该数据集评估了多种大型语言模型（LLMs）在复合词拆解任务上的表现。研究发现，LLM在拆解任务中表现欠佳，尤其对子词分词（subword tokenization）处理不利的单词效果更差。因此，我们提出了一种用于训练专用复合词拆解模型的新方法。该两阶段流程中，第一阶段依赖完全自监督目标，第二阶段则通过监督学习在带标注的维基词典数据上对模型进行可选微调。我们的自监督模型平均准确率较此前最优的无监督复合词拆解模型提升了13.9%，而微调模型则超越了所有此前（语言专属的）复合词拆解工具。此外，我们利用这些模型在构建子词分词器时引入复合词拆解，并将其命名为CompoundPiece。与采用SentencePiece分词的等效模型相比，CompoundPiece平均更优地处理复合词分词，从而显著提升了复合词拆解性能。