We learn a visual representation that captures information about the camera that recorded a given photo. To do this, we train a multimodal embedding between image patches and the EXIF metadata that cameras automatically insert into image files. Our model represents this metadata by simply converting it to text and then processing it with a transformer. The features that we learn significantly outperform other self-supervised and supervised features on downstream image forensics and calibration tasks. In particular, we successfully localize spliced image regions "zero shot" by clustering the visual embeddings for all of the patches within an image.
翻译:我们学习了一种能够捕捉记录给定照片的相机信息的视觉表征。为实现这一目标,我们在图像块与相机自动插入图像文件的EXIF元数据之间训练了一个多模态嵌入。我们的模型通过简单地将元数据转换为文本,然后使用Transformer进行处理来表征这些信息。所学习的特征在下游图像取证和校准任务中显著优于其他自监督和监督特征。特别地,我们通过对图像内所有图像块的视觉嵌入进行聚类,成功实现了"零样本"拼接区域定位。