Social media images have proven to be a valuable source of information for understanding human interactions with important subjects such as cultural heritage, biodiversity, and nature, among others. The task of grouping such images into a number of semantically meaningful clusters without labels is challenging due to the high diversity and complex nature of the visual content in addition to their large volume. On the other hand, recent advances in Large Visual Models (LVMs), Large Language Models (LLMs), and Large Visual Language Models (LVLMs) provide an important opportunity to explore new productive and scalable solutions. This work proposes, analyzes, and compares various approaches based on one or more state-of-the-art LVM, LLM, and LVLM, for mapping social media images into a number of predefined classes. As a case study, we consider the problem of understanding the interactions between humans and nature, also known as Nature's Contribution to People or Cultural Ecosystem Services (CES). Our experiments show that the highest-performing approaches, with accuracy above 95%, still require the creation of a small labeled dataset. These include the fine-tuned LVM DINOv2 and the LVLM LLaVA-1.5 combined with a fine-tuned LLM. The top fully unsupervised approaches, achieving accuracy above 84%, are the LVLMs, specifically the proprietary GPT-4 model and the public LLaVA-1.5 model. Additionally, the LVM DINOv2, when applied in a 10-shot learning setup, delivered competitive results with an accuracy of 83.99%, closely matching the performance of the LVLM LLaVA-1.5.
翻译:社交媒体图像已被证明是理解人类与文化遗产、生物多样性及自然等重要主题互动的宝贵信息来源。由于此类图像数量庞大,且视觉内容高度多样、性质复杂,将其无标签地分组为若干具有语义意义的聚类任务颇具挑战性。另一方面,大型视觉模型、大型语言模型及大型视觉语言模型的最新进展为探索新型高效可扩展的解决方案提供了重要机遇。本研究基于一种或多种前沿的LVM、LLM及LVLM,提出、分析并比较了多种将社交媒体图像映射至若干预定义类别的方法。作为案例研究,我们聚焦于理解人类与自然互动关系的问题,该问题亦被称为"自然对人类的贡献"或文化生态系统服务。实验表明,准确率超过95%的最高性能方法仍需创建小型标注数据集,包括微调后的LVM DINOv2以及结合微调LLM的LVLM LLaVA-1.5。完全无监督方法中表现最佳的是LVLM,特别是专有模型GPT-4和开源模型LLaVA-1.5,其准确率均超过84%。此外,LVM DINOv2在10样本学习设置下取得了83.99%的准确率,与LVLM LLaVA-1.5的性能表现相当接近。