This paper provides a detailed discussion of the multilingual tokenizer used for GPT-SW3. It was trained on the Nordic Pile using the SentencePiece library and the BPE algorithm. We outline the tokenizer's most important features and share details on its learned vocabulary. In addition, we systematically analyze the properties and evaluate the performance of the tokenizer with regard to the different languages present in the data.
翻译:本文详细讨论了用于GPT-SW3的多语言分词器。该分词器基于Nordic Pile语料库,采用SentencePiece库和BPE算法进行训练。我们概述了该分词器最重要的特征,并分享了其学习到的词汇表细节。此外,我们系统分析了分词器的特性,并针对数据中包含的不同语言对其性能进行了评估。