Vision Language Models (VLMs) are rapidly advancing in their capability to answer information-seeking questions. As these models are widely deployed in consumer applications, they could lead to new privacy risks due to emergent abilities to identify people in photos, geolocate images, etc. As we demonstrate, somewhat surprisingly, current open-source and proprietary VLMs are very capable image geolocators, making widespread geolocation with VLMs an immediate privacy risk, rather than merely a theoretical future concern. As a first step to address this challenge, we develop a new benchmark, GPTGeoChat, to test the ability of VLMs to moderate geolocation dialogues with users. We collect a set of 1,000 image geolocation conversations between in-house annotators and GPT-4v, which are annotated with the granularity of location information revealed at each turn. Using this new dataset, we evaluate the ability of various VLMs to moderate GPT-4v geolocation conversations by determining when too much location information has been revealed. We find that custom fine-tuned models perform on par with prompted API-based models when identifying leaked location information at the country or city level; however, fine-tuning on supervised data appears to be needed to accurately moderate finer granularities, such as the name of a restaurant or building.
翻译:视觉语言模型(VLMs)在回答信息查询类问题方面的能力正在迅速提升。随着这些模型在消费级应用中的广泛部署,其识别照片中人物、对图像进行地理定位等新兴能力可能引发新的隐私风险。正如我们所展示的,当前开源和专有的视觉语言模型均具备强大的图像地理定位能力,这使得基于视觉语言模型的广泛地理定位已成为迫切的隐私威胁,而不仅仅是理论上的未来隐忧。为应对这一挑战,我们首先构建了一个新基准测试集GPTGeoChat,用于评估视觉语言模型在调节用户地理定位对话方面的能力。我们收集了1000组由内部标注员与GPT-4v进行的图像地理定位对话,并对每轮对话中揭示的位置信息粒度进行了标注。利用该数据集,我们评估了多种视觉语言模型在调节GPT-4v地理定位对话时的表现,重点考察其判断位置信息是否过度泄露的能力。研究发现:在识别国家或城市层级的位置信息泄露时,定制化微调模型与基于提示的API模型表现相当;然而,要准确调节更细粒度的位置信息(如餐厅或建筑物名称),似乎需要对监督数据进行专门的微调训练。