On Large Uni- and Multi-modal Models for Unsupervised Classification of Social Media Images: Nature's Contribution to People as a case study

Social media images have proven to be a valuable source of information for understanding human interactions with important subjects such as cultural heritage, biodiversity, and nature, among others. The task of grouping such images into a number of semantically meaningful clusters without labels is challenging due to the high diversity and complex nature of the visual content in addition to their large volume. On the other hand, recent advances in Large Visual Models (LVMs), Large Language Models (LLMs), and Large Visual Language Models (LVLMs) provide an important opportunity to explore new productive and scalable solutions. This work proposes, analyzes, and compares various approaches based on one or more state-of-the-art LVM, LLM, and LVLM, for mapping social media images into a number of predefined classes. As a case study, we consider the problem of understanding the interactions between humans and nature, also known as Nature's Contribution to People or Cultural Ecosystem Services (CES). Our experiments show that the highest-performing approaches, with accuracy above 95%, still require the creation of a small labeled dataset. These include the fine-tuned LVM DINOv2 and the LVLM LLaVA-1.5 combined with a fine-tuned LLM. The top fully unsupervised approaches, achieving accuracy above 84%, are the LVLMs, specifically the proprietary GPT-4 model and the public LLaVA-1.5 model. Additionally, the LVM DINOv2, when applied in a 10-shot learning setup, delivered competitive results with an accuracy of 83.99%, closely matching the performance of the LVLM LLaVA-1.5.

翻译：社交媒体图像已被证明是理解人类与文化遗产、生物多样性及自然等重要主题互动的宝贵信息来源。由于此类图像数量庞大，且视觉内容高度多样、性质复杂，将其无标签地分组为若干具有语义意义的聚类任务颇具挑战性。另一方面，大型视觉模型、大型语言模型及大型视觉语言模型的最新进展为探索新型高效可扩展的解决方案提供了重要机遇。本研究基于一种或多种前沿的LVM、LLM及LVLM，提出、分析并比较了多种将社交媒体图像映射至若干预定义类别的方法。作为案例研究，我们聚焦于理解人类与自然互动关系的问题，该问题亦被称为"自然对人类的贡献"或文化生态系统服务。实验表明，准确率超过95%的最高性能方法仍需创建小型标注数据集，包括微调后的LVM DINOv2以及结合微调LLM的LVLM LLaVA-1.5。完全无监督方法中表现最佳的是LVLM，特别是专有模型GPT-4和开源模型LLaVA-1.5，其准确率均超过84%。此外，LVM DINOv2在10样本学习设置下取得了83.99%的准确率，与LVLM LLaVA-1.5的性能表现相当接近。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日