Vision-language models (VLMs) have demonstrated strong performance in image geolocation, a capability further sharpened by frontier multimodal large reasoning models (MLRMs). This poses a significant privacy risk, as these widely accessible models can be exploited to infer sensitive locations from casually shared photos, often at street-level precision, potentially surpassing the level of detail the sharer consented or intended to disclose. While recent work has proposed applying a blanket restriction on geolocation disclosure to combat this risk, these measures fail to distinguish valid geolocation uses from malicious behavior. Instead, VLMs should maintain contextual integrity by reasoning about elements within an image to determine the appropriate level of information disclosure, balancing privacy and utility. To evaluate how well models respect contextual integrity, we introduce VLM-GEOPRIVACY, a benchmark that challenges VLMs to interpret latent social norms and contextual cues in real-world images and determine the appropriate level of location disclosure. Our evaluation of 14 leading VLMs shows that, despite their ability to precisely geolocate images, the models are poorly aligned with human privacy expectations. They often over-disclose in sensitive contexts and are vulnerable to prompt-based attacks. Our results call for new design principles in multimodal systems to incorporate context-conditioned privacy reasoning.
翻译:视觉-语言模型(VLMs)在图像地理位置识别方面表现出强大的性能,这一能力被前沿的多模态大型推理模型(MLRMs)进一步强化。这带来了显著的隐私风险,因为这些广泛可访问的模型可能被滥用于从随意分享的照片中推断敏感位置,通常能达到街道级别的精度,可能超出分享者同意或意图披露的详细程度。尽管近期研究提出了对地理位置披露实施全面限制以应对此风险,但这些措施未能区分有效的地理位置使用与恶意行为。相反,视觉-语言模型应通过推理图像中的元素来保持上下文完整性,以确定适当的信息披露级别,从而平衡隐私与效用。为评估模型在多大程度上尊重上下文完整性,我们引入了VLM-GEOPRIVACY基准,该基准挑战视觉-语言模型解释现实世界图像中的潜在社会规范和上下文线索,并确定适当的位置披露级别。我们对14个领先视觉-语言模型的评估表明,尽管这些模型能够精确地定位图像位置,但它们与人类隐私期望的契合度较差。它们常在敏感情境下过度披露信息,并且易受基于提示的攻击。我们的结果呼吁在多模态系统中引入新的设计原则,以纳入上下文条件化的隐私推理。