Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

The dream of instantly creating rich 360-degree panoramic worlds from text is rapidly becoming a reality, yet a crucial gap exists in our ability to reliably evaluate their semantic alignment. Contrastive Language-Image Pre-training (CLIP) models, standard AI evaluators, predominantly trained on perspective image-text pairs, face an open question regarding their understanding of the unique characteristics of 360-degree panoramic image-text pairs. This paper addresses this gap by first introducing two concepts: \emph{360-degree textual semantics}, semantic information conveyed by explicit format identifiers, and \emph{360-degree visual semantics}, invariant semantics under horizontal circular shifts. To probe CLIP's comprehension of these semantics, we then propose novel evaluation methodologies using keyword manipulation and horizontal circular shifts of varying magnitudes. Rigorous statistical analyses across popular CLIP configurations reveal that: (1) CLIP models effectively leverage explicit textual identifiers, demonstrating an understanding of 360-degree textual semantics; and (2) CLIP models fail to robustly preserve semantic alignment under horizontal circular shifts, indicating limited comprehension of 360-degree visual semantics. To address this limitation, we propose a LoRA-based fine-tuning framework that explicitly instills invariance to circular shifts. Our fine-tuned models exhibit improved comprehension of 360-degree visual semantics, though with a slight degradation in original semantic evaluation performance, highlighting a fundamental trade-off in adapting CLIP to 360-degree panoramic images. Code is available at https://github.com/littlewhitesea/360Semantics.

翻译：从文本即时生成丰富360度全景世界的梦想正迅速成为现实，但在可靠评估其语义对齐方面仍存在关键空白。对比语言-图像预训练（CLIP）模型作为标准AI评估器，主要基于透视图像-文本对进行训练，其对360度全景图像-文本对独特特征的理解仍是一个开放性问题。本文首先通过引入两个概念来填补这一空白：其一为“360度文本语义”，即由显式格式标识符传达的语义信息；其二为“360度视觉语义”，即在水平圆周位移下保持不变的语义。为探究CLIP对这些语义的理解能力，我们随后提出利用关键词操作与不同幅度的水平圆周位移的新型评估方法。通过对多种主流CLIP配置进行严谨统计分析，我们发现：（1）CLIP模型能有效利用显式文本标识符，展现其对360度文本语义的理解；（2）CLIP模型在水平圆周位移下无法稳健保持语义对齐，表明其对360度视觉语义的理解有限。为解决这一局限，我们提出基于LoRA的微调框架，显式引入对圆周位移的不变性。经微调的模型在360度视觉语义理解上表现提升，但原始语义评估性能略有下降，这凸显了将CLIP适配至360度全景图像时的根本性权衡。代码开源于https://github.com/littlewhitesea/360Semantics。