In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We focus on predicting the character (A, B, C, or D) that is the most likely answer to a question, given the context of the question. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available for research purposes only.
翻译:本文评估了大型语言模型(LLMs)在零样本、单样本和少样本设置下,为多项选择问答(MCQA)任务执行多项选择符号绑定(MCSB)的能力。我们聚焦于越南语领域,其具有挑战性的MCQA数据集少于英语。现有的两个数据集ViMMRC 1.0和ViMMRC 2.0侧重于文学。近期越南语自然语言处理(NLP)研究主要利用2019至2023年越南国家高中毕业考试(VNHSGE)数据评估ChatGPT,但此类研究多集中于ChatGPT逐步解答VNHSGE的过程。我们旨在通过为数学、物理、化学和生物学科的LaTeX公式输入提供结构化指南,构建一个新颖且高质量的数据集。由于该数据集采用严格的LaTeX格式书写,可用于评估LLMs及更小语言模型(LMs)的MCSB能力。我们关注在给定问题上下文的情况下,预测最可能回答问题的选项字符(A、B、C或D)。我们在ViMMRC 1.0、ViMMRC 2.0基准测试以及我们提出的数据集上,对六种知名LLMs(BLOOMZ-7.1B-MT、LLaMA-2-7B、LLaMA-2-70B、GPT-3、GPT-3.5和GPT-4.0)进行了评估,结果表明这些模型在越南语MCSB任务中展现出良好性能。该数据集仅供研究用途。