Recent advancements in language models have showcased human-comparable performance in academic entrance exams. However, existing studies often overlook questions that require the integration of visual comprehension, thus compromising the full spectrum and complexity inherent in real-world scenarios. To address this gap, we present a comprehensive framework to evaluate language models on entrance exams, which incorporates both textual and visual elements. We evaluate the two most recent editions of Exame Nacional do Ensino M\'edio (ENEM), the main standardized entrance examination adopted by Brazilian universities. Our study not only reaffirms the capabilities of GPT-4 as the state of the art for handling complex multidisciplinary questions, but also pioneers in offering a realistic assessment of multimodal language models on Portuguese examinations. One of the highlights is that text captions transcribing visual content outperform the direct use of images, suggesting that the vision model has room for improvement. Yet, despite improvements afforded by images or captions, mathematical questions remain a challenge for these state-of-the-art models. The code and data used on experiments are available at https://github.com/piresramon/gpt-4-enem.
翻译:近期语言模型的进步已在学术入学考试中展现出与人类相当的性能。然而,现有研究往往忽视需要整合视觉理解能力的问题,从而未能涵盖真实场景中固有的完整维度与复杂性。为填补这一空白,我们提出了一套综合性评估框架,用于测试语言模型在包含文本与视觉元素的入学考试中的表现。我们评估了巴西大学主要标准化入学考试——巴西国家中学教育水平考试(ENEM)最新两届的试题。本研究不仅再次证实GPT-4作为处理复杂多学科问题的最先进技术的能力,更开创性地对葡萄牙语考试中的多模态语言模型进行了现实评估。重要发现表明:转录视觉内容的文本标题优于直接使用图像,这提示视觉模型仍有改进空间。然而,即便图像或标题带来了性能提升,数学问题对这类最先进模型而言仍是挑战。实验所用代码与数据可在https://github.com/piresramon/gpt-4-enem获取。