Visual in-Context Learning (VICL) aims at making progress towards adaptive vision models, that can -- based on a few examples -- adapt to a new task at test-time. With the history of in-context learning in natural language processing research, where large, parameter-heavy models are in use, one pathway that current VICL methods take is model- and data-scaling as key ingredients. Yet, it is not clear, whether these ingredients are the key for in-context learning to take shape in vision models. To stress-test such large models, we challenge them with an extreme counterexample: we train a tiny visual in-context model with merely $1$ million parameters and a modest amount of $70,000$ images. We compare the results of this severely capacity capped tiny model to $7,000\times$ larger VICL models in different adaptive settings, (1) on image data with small distribution shifts, (2) on unseen task encodings and (3) on a completely new task, i.e., the setting VICL envisions. With the chasm of training resources between the tiny- and large models, our experiments showcase a lack in how adaptive capabilities are measured, with respect to how tasks are encoded, which tasks were used in pre-training and the choice of metrics. These gaps in current VICL benchmarking underscore a need for innovation in evaluation of adaptive capabilities.
翻译:视觉情境学习(VICL)旨在推进自适应视觉模型的发展——这类模型能够基于少量示例在测试时适应新任务。借鉴自然语言处理研究中大规模参数量模型在情境学习中的发展历程,当前VICL方法的主要路径是将模型和数据规模扩展作为关键要素。然而,这些要素是否是视觉模型形成情境学习能力的关键尚不明确。为对这类大模型进行压力测试,我们采用极端反例:训练一个仅有$1$百万参数、使用$70,000$张图像的微型视觉情境学习模型。将该严重容量受限的微型模型与$7,000$倍参数规模的VICL模型在三种自适应场景下进行对比:(1)存在微小分布偏移的图像数据;(2)未见过的任务编码;(3)全新任务——这正是VICL设想的应用场景。微型模型与大模型之间训练资源的巨大鸿沟,揭示了当前自适应能力评估在任务编码方式、预训练任务选择及度量指标方面存在的缺陷。这些VICL基准测试中的差距凸显了自适应能力评估方法创新的迫切需求。