One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset of entrance exams from the two leading universities in Brazil: UNICAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. Furthermore, BLUEX includes a collection of recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023. The dataset is also annotated to indicate the position of images in each question, providing a valuable resource for advancing the state-of-the-art in multimodal language understanding and reasoning. We describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. The data and relevant code can be found at https://github.com/Portuguese-Benchmark-Datasets/BLUEX
翻译:近期语言模型研究的一个普遍趋势是使用标准化测试进行评估。然而,尽管葡萄牙语是全球第五大语言,目前针对该语言的此类评估仍十分有限,这主要源于高质量数据集的缺失。为填补这一空白,我们提出巴西顶尖大学入学考试基准数据集(BLUEX),该数据集收录了巴西两所顶尖大学——坎皮纳斯州立大学和圣保罗大学的入学考试试题。数据集包含标注元数据,可用于评估自然语言处理模型在多学科上的表现。此外,BLUEX收录了一批近期施行的考试试题,这些试题基本不会出现在截至2023年多数主流语言模型的训练数据中。同时,数据集还标注了每道试题中图像的位置信息,为推进多模态语言理解与推理领域的技术发展提供了宝贵的资源。我们详细阐述了BLUEX的构建过程与特性,并通过与前沿语言模型的实验建立了基准测试,验证了其在推动葡萄牙语自然语言理解与推理技术发展方面的潜力。相关数据与代码已开源发布于https://github.com/Portuguese-Benchmark-Datasets/BLUEX。