In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We focus on predicting the character (A, B, C, or D) that is the most likely answer to a question, given the context of the question. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available for research purposes only.
翻译:本文评估了大语言模型(LLMs)在零样本、单样本和少样本设置下,对多项选择题问答(MCQA)任务执行多项选择符号绑定(MCSB)的能力。我们聚焦于越南语,其具有挑战性的MCQA数据集少于英语。现有的两个数据集ViMMRC 1.0和ViMMRC 2.0主要关注文学领域。近期越南语自然语言处理(NLP)研究利用2019年至2023年的越南国家高中毕业考试(VNHSGE)评估ChatGPT。然而,这些研究主要关注ChatGPT逐步解决VNHSGE的方式。我们旨在通过为数学、物理、化学和生物学科的LaTeX公式输入提供结构化指南,创建一个新颖且高质量的数据集。该数据集以严格的LaTeX格式输入,可用于评估LLMs和较小语言模型(LMs)的MCSB能力。我们专注于根据问题上下文预测最可能答案的字符(A、B、C或D)。我们在ViMMRC 1.0、ViMMRC 2.0基准测试及我们提出的数据集上,对六个知名LLM(BLOOMZ-7.1B-MT、LLaMA-2-7B、LLaMA-2-70B、GPT-3、GPT-3.5和GPT-4.0)进行了评估,结果显示这些模型在越南语MCSB能力方面具有良好表现。该数据集仅用于研究目的。