In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the current tokenization algorithms introduce to non-Latin script languages, the main one being over-segmentation. In this work, we propose MAGNET; multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization. MAGNET learns to predict segment boundaries between byte tokens in a sequence via sub-modules within the model, which act as internal boundary predictors (tokenizers). Previous gradient-based tokenization methods aimed for uniform compression across sequences by integrating a single boundary predictor during training and optimizing it end-to-end through stochastic reparameterization alongside the next token prediction objective. However, this approach still results in over-segmentation for non-Latin script languages in multilingual settings. In contrast, MAGNET offers a customizable architecture where byte-level sequences are routed through language-script-specific predictors, each optimized for its respective language script. This modularity enforces equitable segmentation granularity across different language scripts compared to previous methods. Through extensive experiments, we demonstrate that in addition to reducing segmentation disparities, MAGNET also enables faster language modelling and improves downstream utility.
翻译:在多语言环境中,非拉丁文字和低资源语言通常在语言模型的效用、效率和成本方面处于不利地位。具体而言,先前研究已指出当前分词算法对非拉丁文字语言引入的多种建模偏差,其中最主要的是过度分词。在本工作中,我们提出MAGNET:一种基于自适应梯度的多语言分词方法,旨在通过基于梯度的自适应子词分词来减少过度分词。MAGNET通过学习模型内部的子模块(作为内部边界预测器即分词器)来预测字节序列中的分词边界。以往的基于梯度的分词方法通过在训练中集成单一边界预测器,并通过随机重参数化与下一词元预测目标进行端到端优化,以实现跨序列的均匀压缩。然而,这种方法在多语言环境下仍会导致非拉丁文字语言的过度分词。相比之下,MAGNET提供了一种可定制的架构,其中字节级序列被路由至特定语言文字对应的预测器,每个预测器针对其相应的语言文字进行优化。与先前方法相比,这种模块化设计确保了不同语言文字之间更公平的分词粒度。通过大量实验,我们证明MAGNET不仅能减少分词差异,还能实现更快的语言建模并提升下游任务的效用。