Recent advances in text-to-video generation have produced visually compelling results, yet it remains unclear whether these models encode geographically equitable visual knowledge. In this work, we investigate the geo-equity and geographically grounded visual knowledge of text-to-video models through an attraction-centric evaluation. We introduce Geo-Attraction Landmark Probing (GAP), a systematic framework for assessing how faithfully models synthesize tourist attractions from diverse regions, and construct GEOATTRACTION-500, a benchmark of 500 globally distributed attractions spanning varied regions and popularity levels. GAP integrates complementary metrics that disentangle overall video quality from attraction-specific knowledge, including global structural alignment, fine-grained keypoint-based alignment, and vision-language model judgments, all validated against human evaluation. Applying GAP to the state-of-the-art text-to-video model Sora 2, we find that, contrary to common assumptions of strong geographic bias, the model exhibits a relatively uniform level of geographically grounded visual knowledge across regions, development levels, and cultural groupings, with only weak dependence on attraction popularity. These results suggest that current text-to-video models express global visual knowledge more evenly than expected, highlighting both their promise for globally deployed applications and the need for continued evaluation as such systems evolve.
翻译:近期文本到视频生成技术的进展已产生视觉上引人注目的成果,但这些模型是否编码了地理上公平的视觉知识仍不明确。本研究通过以旅游景点为中心的评估,探究文本到视频模型的地理公平性及地理接地的视觉知识。我们提出了地理景点地标探测框架,这是一个用于系统评估模型对来自不同地区旅游景点合成忠实度的框架,并构建了GEOATTRACTION-500基准数据集,包含500个全球分布、涵盖不同地区和知名度水平的景点。GAP整合了互补的度量指标,将整体视频质量与景点特定知识解耦,包括全局结构对齐、基于细粒度关键点的对齐以及视觉语言模型判断,所有指标均经过人工评估验证。将GAP应用于最先进的文本到视频模型Sora 2,我们发现与普遍存在的强地理偏见假设相反,该模型在不同地区、发展水平和文化群体中表现出相对统一的地理接地视觉知识水平,仅与景点知名度存在微弱依赖关系。这些结果表明,当前文本到视频模型表达的全球视觉知识比预期更为均衡,既凸显了其在全球部署应用中的潜力,也强调了随着此类系统发展需要持续进行评估的必要性。