Geometric shapes play important roles in both physical world and human cognition. While multimodal large language models (MLLMs) have made significant advancements in visual understanding, their abilities to recognize geometric shapes and their spatial relationships, which we term \emph{geometric perception}, are not explicitly and systematically explored. To address this gap, we introduce GePBench, a novel benchmark specifically designed to assess the geometric perception capabilities of MLLMs. Our extensive evaluations reveal that even the current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks. Furthermore, we show that models trained with GePBench data demonstrate considerable improvements on a wide range of downstream tasks, highlighting the critical role of geometric perception in enabling advanced multimodal applications. Our code and datasets are available at \href{https://github.com/Changhao-Xiang/GePBench}{https://github.com/Changhao-Xiang/GePBench}.
翻译:几何形状在物理世界和人类认知中均发挥着重要作用。尽管多模态大语言模型(MLLMs)在视觉理解方面取得了显著进展,但其对几何形状及其空间关系的识别能力(我们称之为“几何感知”)尚未得到明确且系统的探索。为填补这一空白,我们提出了GePBench——一个专门用于评估MLLMs几何感知能力的新型基准测试。我们的广泛评估表明,即使是当前最先进的MLLMs在几何感知任务中也存在明显缺陷。此外,我们发现,使用GePBench数据训练的模型在一系列下游任务中展现出显著改进,这凸显了几何感知在实现高级多模态应用中的关键作用。我们的代码和数据集可在\href{https://github.com/Changhao-Xiang/GePBench}{https://github.com/Changhao-Xiang/GePBench}获取。