Recent advancements in text-only large language models (LLMs) have highlighted the benefit of in-context learning for adapting to new tasks with a few demonstrations. However, extending in-context learning to large vision-language models (VLMs) using a huge amount of naturalistic vision-language data has shown limited success, particularly for egocentric videos, due to high data collection costs. We propose a novel training method $\mathbb{E}$fficient $\mathbb{I}$n-context $\mathbb{L}$earning on $\mathbb{E}$gocentric $\mathbb{V}$ideos ($\mathbb{EILEV}$), which elicits in-context learning in VLMs for egocentric videos without requiring massive, naturalistic egocentric video datasets. $\mathbb{EILEV}$ involves architectural and training data adaptations to allow the model to process contexts interleaved with video clips and narrations, sampling of in-context examples with clusters of similar verbs and nouns, use of data with skewed marginal distributions with a long tail of infrequent verbs and nouns, as well as homonyms and synonyms. Our evaluations show that $\mathbb{EILEV}$-trained models outperform larger VLMs trained on a huge amount of naturalistic data in in-context learning. Furthermore, they can generalize to not only out-of-distribution, but also novel, rare egocentric videos and texts via in-context learning, demonstrating potential for applications requiring cost-effective training, and rapid post-deployment adaptability. Our code and demo are available at \url{https://github.com/yukw777/EILEV}.
翻译:近期文本型大语言模型(LLMs)的进展突显了通过少量示例进行上下文学习以适配新任务的优势。然而,利用海量自然视觉-语言数据将上下文学习扩展至大型视觉-语言模型(VLMs)的尝试成效有限,尤其对自我中心视频而言,原因在于数据采集成本高昂。我们提出一种新颖的训练方法$\mathbb{E}$fficient $\mathbb{I}$n-context $\mathbb{L}$earning on $\mathbb{E}$gocentric $\mathbb{V}$ideos($\mathbb{EILEV}$),该方法能够在不依赖大规模自然自我中心视频数据集的前提下,激发VLMs对自我中心视频的上下文学习能力。$\mathbb{EILEV}$包含架构与训练数据的适配:允许模型处理视频片段与叙述交替的上下文、基于相似动词与名词聚类采样上下文示例、使用具有偏态边际分布(含低频动词与名词的长尾分布)及同音异义词与同义词的数据。评估表明,经$\mathbb{EILEV}$训练的模型在上下文学习任务上优于使用海量自然数据训练的大型VLMs。此外,这些模型不仅能泛化至分布外场景,还能通过上下文学习适应新颖、罕见的自我中心视频与文本,展现了在低训练成本与部署后快速适配需求场景中的应用潜力。我们的代码与演示访问地址为:\url{https://github.com/yukw777/EILEV}。