The rapid advancement of large language models (LLMs) necessitates the development of new benchmarks to accurately assess their capabilities. To address this need for Vietnamese, this work aims to introduce ViLLM-Eval, the comprehensive evaluation suite designed to measure the advanced knowledge and reasoning abilities of foundation models within a Vietnamese context. ViLLM-Eval consists of multiple-choice questions and predict next word tasks spanning various difficulty levels and diverse disciplines, ranging from humanities to science and engineering. A thorough evaluation of the most advanced LLMs on ViLLM-Eval revealed that even the best performing models have significant room for improvement in understanding and responding to Vietnamese language tasks. ViLLM-Eval is believed to be instrumental in identifying key strengths and weaknesses of foundation models, ultimately promoting their development and enhancing their performance for Vietnamese users.
翻译:大语言模型(LLM)的快速发展要求开发新的基准测试来准确评估其能力。为满足越南语领域的这一需求,本文旨在介绍ViLLM-Eval——专为衡量基础模型在越南语语境下高级知识与推理能力而设计的综合评估套件。ViLLM-Eval包含涵盖从人文社科到理工等多学科领域、不同难度层级的选择题与下一词预测任务。对当前最先进LLM在ViLLM-Eval上的全面评估显示,即使表现最好的模型在理解与响应越南语任务方面仍存在显著改进空间。该评估套件有望成为识别基础模型关键优势与短板的重要工具,最终推动其发展并提升面向越南语用户的表现性能。