The distribution of subpopulations is an important property hidden within a dataset. Uncovering and analyzing the subpopulation distribution within datasets provides a comprehensive understanding of the datasets, standing as a powerful tool beneficial to various downstream tasks, including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite its importance, there has been no work that systematically explores the subpopulation distribution of datasets to our knowledge. To address the limitation and solve all the mentioned tasks in a unified way, we introduce a novel concept of subpopulation structures to represent, analyze, and utilize subpopulation distributions within datasets. To characterize the structures in an interpretable manner, we propose the Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework, which employs world knowledge and instruction-following capabilities of Large Language Models (LLMs) to linguistically analyze informative image captions and summarize the structures. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery.
翻译:数据集中隐藏的子群体分布是一项重要特性。揭示并分析数据集内的子群体分布有助于全面理解数据集,是服务于多种下游任务的强大工具,包括数据集子群体组织、子群体偏移与切片发现。尽管其重要性显著,据我们所知,目前尚无工作系统性地探索数据集的子群体分布。为弥补这一局限并以统一方式解决所有提及的任务,我们引入了子群体结构这一新概念,用以表示、分析并利用数据集内的子群体分布。为以可解释的方式刻画这些结构,我们提出了基于大语言模型的子群体结构发现框架,该框架利用大语言模型的世界知识与指令遵循能力,对信息丰富的图像描述进行语言分析并总结结构。此外,我们提出了名为任务特定调优的完整工作流程,以应对下游任务,展示了所发现结构在包括数据集子群体组织、子群体偏移与切片发现等一系列子群体相关任务中的应用。