CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

While many languages possess processes of joining two or more words to create compound words, previous studies have been typically limited only to languages with excessively productive compound formation (e.g., German, Dutch) and there is no public dataset containing compound and non-compound words across a large number of languages. In this work, we systematically study decompounding, the task of splitting compound words into their constituents, at a wide scale. We first address the data gap by introducing a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We then use this dataset to evaluate an array of Large Language Models (LLMs) on the decompounding task. We find that LLMs perform poorly, especially on words which are tokenized unfavorably by subword tokenization. We thus introduce a novel methodology to train dedicated models for decompounding. The proposed two-stage procedure relies on a fully self-supervised objective in the first stage, while the second, supervised learning stage optionally fine-tunes the model on the annotated Wiktionary data. Our self-supervised models outperform the prior best unsupervised decompounding models by 13.9% accuracy on average. Our fine-tuned models outperform all prior (language-specific) decompounding tools. Furthermore, we use our models to leverage decompounding during the creation of a subword tokenizer, which we refer to as CompoundPiece. CompoundPiece tokenizes compound words more favorably on average, leading to improved performance on decompounding over an otherwise equivalent model using SentencePiece tokenization.

翻译：尽管许多语言中存在将两个或更多单词组合成复合词的过程，但以往研究通常仅限于复合词构成过度能产的语言（如德语、荷兰语），且尚无包含多种语言复合词与非复合词的公开数据集。本研究系统性地探索了复合词分解——即将复合词拆分为其组成部分的任务——在大规模语言上的应用。我们首先通过引入一个涵盖56种语言、包含25.5万个复合词与非复合词的维基词典数据集，填补了数据空白。随后利用该数据集评估了多种大型语言模型在复合词分解任务上的表现，发现其性能欠佳，尤其在子词分词器不利切分的单词上表现较差。为此，我们提出了一种训练专用复合词分解模型的新方法。该方法采用两阶段流程：第一阶段基于完全自监督目标，第二阶段通过监督学习在标注的维基词典数据上对模型进行可选微调。自监督模型在平均准确率上较此前最优的无监督复合词分解模型提升13.9%，微调模型则超越了所有此前（特定语言）的复合词分解工具。此外，我们利用该模型在构建子词分词器过程中融入复合词分解技术，称之为CompoundPiece。与同等的SentencePiece分词模型相比，CompoundPiece对复合词的平均分词效果更优，从而进一步提升了复合词分解性能。