Multimodal large language models (MLLMs) have achieved significant advancements in integrating visual and linguistic understanding. While existing benchmarks evaluate these models in context-rich, real-life scenarios, they often overlook fundamental perceptual skills essential for environments deviating from everyday realism. In particular, geometric perception, the ability to interpret spatial relationships and abstract visual patterns, remains underexplored. To address this limitation, we introduce GePBench, a novel benchmark designed to assess the geometric perception capabilities of MLLMs. Results from extensive evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in such tasks. Additionally, we demonstrate that models trained with data sourced from GePBench show notable improvements on a wide range of downstream tasks, underscoring the importance of geometric perception as a foundation for advanced multimodal applications. Our code and datasets will be publicly available.
翻译:多模态大语言模型(MLLMs)在整合视觉与语言理解方面取得了显著进展。现有基准测试通常在上下文丰富、贴近现实生活的场景中评估这些模型,却往往忽略了对于偏离日常真实感的环境至关重要的基础感知能力。特别是几何感知——即理解空间关系与抽象视觉模式的能力——仍未得到充分探索。为弥补这一不足,我们提出了GePBench,这是一个旨在评估MLLMs几何感知能力的新型基准测试。大量评估结果表明,当前最先进的MLLMs在此类任务中存在显著不足。此外,我们证明使用源自GePBench的数据进行训练的模型在广泛的下游任务中表现出明显提升,这突显了几何感知作为高级多模态应用基础的重要性。我们的代码与数据集将公开提供。