This paper explores the capacity of computer vision models to discern temporal information in visual content, focusing specifically on historical photographs. We investigate the dating of images using OpenCLIP, an open-source implementation of CLIP, a multi-modal language and vision model. Our experiment consists of three steps: zero-shot classification, fine-tuning, and analysis of visual content. We use the \textit{De Boer Scene Detection} dataset, containing 39,866 gray-scale historical press photographs from 1950 to 1999. The results show that zero-shot classification is relatively ineffective for image dating, with a bias towards predicting dates in the past. Fine-tuning OpenCLIP with a logistic classifier improves performance and eliminates the bias. Additionally, our analysis reveals that images featuring buses, cars, cats, dogs, and people are more accurately dated, suggesting the presence of temporal markers. The study highlights the potential of machine learning models like OpenCLIP in dating images and emphasizes the importance of fine-tuning for accurate temporal analysis. Future research should explore the application of these findings to color photographs and diverse datasets.
翻译:本文探讨了计算机视觉模型辨别视觉内容中时间信息的能力,重点关注历史照片。我们利用OpenCLIP(CLIP的开源实现,一种多模态语言与视觉模型)研究图像年代测定。实验包含三个步骤:零样本分类、微调以及视觉内容分析。我们使用《De Boer场景检测》数据集,其中包含1950年至1999年间39866张灰度历史新闻照片。结果表明,零样本分类在图像年代测定方面相对无效,且存在倾向于预测过去日期的偏差。通过逻辑分类器对OpenCLIP进行微调可提升性能并消除该偏差。此外,我们的分析显示,包含公交车、汽车、猫、狗和人物图像的年代测定更为准确,表明这些图像中存在时间标记。本研究凸显了OpenCLIP等机器学习模型在图像年代测定中的潜力,并强调了微调对实现精确时间分析的重要性。未来研究应探索将这些发现应用于彩色照片及多样化数据集。