This paper proposes a method to optimize tokenization for the performance improvement of already trained downstream models. Our method generates tokenization results attaining lower loss values of a given downstream model on the training data for restricting vocabularies and trains a tokenizer reproducing the tokenization results. Therefore, our method can be applied to variety of tokenization methods, while existing work cannot due to the simultaneous learning of the tokenizer and the downstream model. This paper proposes an example of the BiLSTM-based tokenizer with vocabulary restriction, which can capture wider contextual information for the tokenization process than non-neural-based tokenization methods used in existing work. Experimental results on text classification in Japanese, Chinese, and English text classification tasks show that the proposed method improves performance compared to the existing methods for tokenization optimization.
翻译:本文提出了一种优化分词的方法,旨在提升已训练下游模型的性能。该方法通过生成在训练数据上使给定下游模型损失值更低的分词结果,以限制词汇表,并训练一个能够重现该分词结果的分词器。因此,本文方法可适用于多种分词方法,而现有工作因需同时学习分词器和下游模型而无法做到这一点。本文提出了一个基于BiLSTM且具备词汇限制的分词器示例,与现有工作中使用的非神经分词方法相比,其能捕获更广泛的上下文信息以辅助分词过程。在日语、中文和英文文本分类任务上的实验结果表明,与现有的分词优化方法相比,本文方法提升了模型性能。