Cognitive textual and visual reasoning tasks, such as puzzles, series, and analogies, demand the ability to quickly reason, decipher, and evaluate patterns both textually and spatially. While LLMs and VLMs, through extensive training on large amounts of human-curated data, have attained a high level of pseudo-human intelligence in some common sense reasoning tasks, they still struggle with more complex reasoning tasks that require cognitive understanding. In this work, we introduce a new dataset, NTSEBench, designed to evaluate the cognitive multi-modal reasoning and problem-solving skills of large models. The dataset comprises 2,728 multiple-choice questions comprising of a total of 4,642 images across 26 categories sampled from the NTSE examination conducted nationwide in India, featuring both visual and textual general aptitude questions that do not rely on rote learning. We establish baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison between open source and propriety models, we propose four distinct modeling strategies to handle different modalities (text and images) in the dataset instances.
翻译:认知性文本与视觉推理任务(如谜题、数列和类比)要求具备快速推理、解读和评估文本与空间模式的能力。尽管大型语言模型(LLMs)和视觉语言模型(VLMs)通过对大量人工标注数据进行广泛训练,在某些常识推理任务中已达到较高水平的伪人类智能,但在需要认知理解的更复杂推理任务上仍存在困难。本研究引入了一个新数据集NTSEBench,旨在评估大型模型的认知多模态推理与问题解决能力。该数据集包含2,728道多项选择题,共计4,642张图像,涵盖26个类别,样本选自印度全国范围内实施的NTSE考试,其视觉与文本通用能力题目均不依赖机械记忆。我们使用最先进的LLMs和VLMs在数据集上建立了基线。为促进开源模型与专有模型的比较,我们针对数据实例中的不同模态(文本和图像)提出了四种不同的建模策略。