While WangchanBERTa has become the de facto standard in transformer-based Thai language modeling, it still has shortcomings in regard to the understanding of foreign words, most notably English words, which are often borrowed without orthographic assimilation into Thai in many contexts. We identify the lack of foreign vocabulary in WangchanBERTa's tokenizer as the main source of these shortcomings. We then expand WangchanBERTa's vocabulary via vocabulary transfer from XLM-R's pretrained tokenizer and pretrain a new model using the expanded tokenizer, starting from WangchanBERTa's checkpoint, on a new dataset that is larger than the one used to train WangchanBERTa. Our results show that our new pretrained model, PhayaThaiBERT, outperforms WangchanBERTa in many downstream tasks and datasets.
翻译:尽管WangchanBERTa已成为基于Transformer的泰语语言建模事实标准,但在处理外来词(尤其是英语词汇)方面仍存在不足——这些词汇在泰语诸多语境中常以未经正字法同化的借词形式使用。我们发现WangchanBERTa分词器对外来词汇的缺失是导致这些缺陷的主因。为此,我们通过从XLM-R预训练分词器进行词汇迁移来扩展WangchanBERTa的词汇表,并基于WangchanBERTa的检查点,在比原始训练数据集更大的新数据集上使用扩展后的分词器重新预训练模型。实验结果表明,新预训练模型PhayaThaiBERT在多项下游任务和数据集上的表现均优于WangchanBERTa。