Recent work in vision-and-language demonstrates that large-scale pretraining can learn generalizable models that are efficiently transferable to downstream tasks. While this may improve dataset-scale aggregate metrics, analyzing performance around hand-crafted subgroups targeting specific bias dimensions reveals systemic undesirable behaviors. However, this subgroup analysis is frequently stalled by annotation efforts, which require extensive time and resources to collect the necessary data. Prior art attempts to automatically discover subgroups to circumvent these constraints but typically leverages model behavior on existing task-specific annotations and rapidly degrades on more complex inputs beyond "tabular" data, none of which study vision-and-language models. This paper presents VLSlice, an interactive system enabling user-guided discovery of coherent representation-level subgroups with consistent visiolinguistic behavior, denoted as vision-and-language slices, from unlabeled image sets. We show that VLSlice enables users to quickly generate diverse high-coherency slices in a user study (n=22) and release the tool publicly.
翻译:近期视觉-语言领域的研究表明,大规模预训练能够学习到可迁移至下游任务的高泛化性模型。尽管这能提升数据集级别的聚合指标,但针对特定偏见维度手工构建子组的性能分析揭示了系统性的不良行为。然而,这种子组分析常因标注工作而停滞——收集所需数据需耗费大量时间与资源。现有技术尝试通过自动发现子组来规避这些限制,但通常依赖对现有任务特定标注的模型行为,且对超出"表格型"数据的复杂输入快速退化,均未涉及对视觉-语言模型的研究。本文提出VLSlice这一交互式系统,支持用户引导式发现具有一致视觉-语言行为的表征级高内聚子组(定义为视觉-语言切片),且无需图像集标签。我们通过用户研究(n=22)证明,VLSlice能帮助用户快速生成多样化的高内聚切片,并将该工具公开发布。