Recent advancements in large language models (LLMs) have underscored their importance in the evolution of artificial intelligence. However, despite extensive pretraining on multilingual datasets, available open-sourced LLMs exhibit limited effectiveness in processing Vietnamese. The challenge is exacerbated by the absence of systematic benchmark datasets and metrics tailored for Vietnamese LLM evaluation. To mitigate these issues, we have finetuned LLMs specifically for Vietnamese and developed a comprehensive evaluation framework encompassing 10 common tasks and 31 metrics. Our evaluation results reveal that the fine-tuned LLMs exhibit enhanced comprehension and generative capabilities in Vietnamese. Moreover, our analysis indicates that models with more parameters can introduce more biases and uncalibrated outputs and the key factor influencing LLM performance is the quality of the training or fine-tuning datasets. These insights underscore the significance of meticulous fine-tuning with high-quality datasets in enhancing LLM performance.
翻译:近期大语言模型(LLM)的进展凸显了其在人工智能发展中的重要性。然而,尽管在多语言数据集上进行了广泛预训练,现有开源LLM在处理越南语时仍表现有限。这一挑战因缺乏针对越南语LLM评估的系统性基准数据集与指标而加剧。为解决这些问题,我们针对越南语微调了LLM,并开发了一套涵盖10项常见任务与31项指标的综合评估框架。评估结果显示,微调后的LLM在越南语理解与生成能力上有所增强。此外,我们的分析表明,参数更多的模型可能引入更多偏差与非校准输出,而影响LLM性能的关键因素是训练或微调数据集的质量。这些见解强调了使用高质量数据集进行细致微调对提升LLM性能的重要性。