RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench, a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0°) images, while certain models are able to identify upside-down (180°) images. None can reliably distinguish between 90° and 270° rotated images. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90° and 270° rotations, despite substantially improving the identification of 180° images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.

翻译：本研究旨在探究多模态大语言模型（MLLMs）在识别输入图像经过0°、90°、180°和270°旋转后的方向时能达到何种准确度。该任务要求模型具备强大的视觉推理能力，以检测旋转线索并理解图像内部的空间关系，而不受其方向影响。为评估MLLMs在此类能力上的表现，我们提出了RotBench——一个包含350张经过人工筛选的生活照、人像与风景图像的基准测试集。尽管该任务相对简单，我们的实验表明，包括GPT-5、o3和Gemini-2.5-Pro在内的多个前沿开源与专有MLLMs均无法可靠识别输入图像的旋转状态。为模型提供辅助信息（包括图像描述、深度图等）或采用思维链提示策略仅能带来有限且不稳定的性能提升。结果显示，大多数模型能可靠识别正向（0°）图像，部分模型可识别倒置（180°）图像，但所有模型均无法稳定区分90°与270°旋转的图像。同时展示不同旋转方向的图像能使推理模型获得中等程度的性能提升，而采用投票机制的改进方案则可增强较弱模型的表现。进一步研究表明，尽管微调显著提升了模型对180°旋转图像的识别能力，却未能改善其区分90°与270°旋转的能力。这些发现共同揭示了MLLMs在旋转识别任务中的空间推理能力与人类感知水平之间存在显著差距。