Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.
翻译:大型语言模型(LLMs)通过预训练预测下一个词,但其扩展需要大量计算资源。众多大型科技公司和研究机构为满足当前需求,开发了多语言LLMs(MLLMs),却忽略了资源匮乏语言(LRLs)。本研究提出了三种基于公开MLLMs提升LRL性能的策略:第一,扩展MLLM中LRL的词汇表以增强表达能力;第二,使用双语数据进行预训练以对齐高资源语言与低资源语言;第三,构建高质量小规模指令数据集并进行指令微调以增强LRL性能。实验采用Llama2模型,将韩语作为LRL,在八项任务中与其他成熟LLMs进行定量评估。此外,基于人工评估和GPT4进行了定性分析。实验结果表明,我们提出的Bllossom模型在定性分析中展现出优于此前韩语单语模型的性能。