Latest instruction-tuned large language models (LLM) show great results on various tasks, however, they often face performance degradation for non-English input. There is evidence that the reason lies in inefficient tokenization caused by low language representation in pre-training data which hinders the comprehension of non-English instructions, limiting the potential of target language instruction-tuning. In this work we investigate the possibility of addressing the issue with vocabulary substitution in the context of LLaMa Russian language adaptation. We explore three variants of vocabulary adaptation and test their performance on Saiga instruction-tuning and fine-tuning on Russian Super Glue benchmark. The results of automatic evaluation show that vocabulary substitution not only improves the model's quality in Russian but also accelerates fine-tuning (35%) and inference (up to 60%) while reducing memory consumption. Additional human evaluation of the instruction-tuned models demonstrates that models with Russian-adapted vocabulary generate answers with higher user preference than the original Saiga-LLaMa model.
翻译:最新指令调优的大语言模型(LLM)在各种任务中展现出卓越性能,但在处理非英语输入时常出现性能下降。证据表明,问题根源在于预训练数据中语言表征不足导致的低效词汇切分,这阻碍了模型对非英语指令的理解,限制了目标语言指令调优的潜力。本研究探讨了在LLaMa俄语适配背景下通过词汇替换解决该问题的可行性。我们探索了三种词汇适配变体,并在Saiga指令调优和俄语Super Glue基准测试中评估其性能。自动评估结果表明,词汇替换不仅提升了模型在俄语上的质量,还加速了微调(35%)和推理(最高60%),同时降低了内存消耗。对指令调优模型的额外人工评估显示,采用俄语适配词汇的模型生成的答案比原始Saiga-LLaMa模型更受用户青睐。