Top-down images play an important role in safety-critical settings such as autonomous navigation and aerial surveillance, where they provide holistic spatial information that front-view images cannot capture. Despite this, Vision Language Models (VLMs) are mostly trained and evaluated on front-view benchmarks, leaving their performance in the top-down setting poorly understood. Existing evaluations also overlook a unique property of top-down images: their physical meaning is preserved under rotation. In addition, conventional accuracy metrics can be misleading, since they are often inflated by hallucinations or "lucky guesses", which obscures a model's true reliability and its grounding in visual evidence. To address these issues, we introduce TDBench, a benchmark for top-down image understanding that includes 2000 curated questions for each rotation. We further propose RotationalEval (RE), which measures whether models provide consistent answers across four rotated views of the same scene, and we develop a reliability framework that separates genuine knowledge from chance. Finally, we conduct four case studies targeting underexplored real-world challenges. By combining rigorous evaluation with reliability metrics, TDBench not only benchmarks VLMs in top-down perception but also provides a new perspective on trustworthiness, guiding the development of more robust and grounded AI systems. Project homepage: https://github.com/Columbia-ICSL/TDBench
翻译:俯视图像在自动驾驶导航和空中监视等安全关键场景中扮演着重要角色,其能提供前视图像无法捕捉的整体空间信息。尽管如此,视觉语言模型主要在面向前视图像的基准上进行训练和评估,导致其在俯视场景下的性能表现尚未得到充分理解。现有评估方法还忽略了俯视图像的一个独特属性:其物理意义在旋转操作下保持不变。此外,传统准确率指标可能具有误导性,因为它们常因模型幻觉或“侥幸猜测”而被虚高,这掩盖了模型真实的可靠性及其对视觉证据的依赖程度。为解决这些问题,我们提出了TDBench——一个针对俯视图像理解的基准测试,其中为每个旋转角度精心设计了2000道问题。我们进一步提出旋转评估方法,通过测量模型对同一场景四个旋转视角能否给出一致答案来评估其性能,并开发了能够区分真实知识与随机猜测的可靠性评估框架。最后,我们针对四个尚未充分探索的现实挑战开展了案例研究。通过将严格评估与可靠性指标相结合,TDBench不仅为俯视感知任务中的视觉语言模型提供了性能基准,还为评估模型可信度提供了新视角,从而推动构建更鲁棒且基于视觉证据的人工智能系统。项目主页:https://github.com/Columbia-ICSL/TDBench