Large language models are well-known to be effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We observe that many-shot ICL, including up to almost 2,000 multimodal demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. Given the high inference costs associated with the long prompts required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, we measure ICL data efficiency of the models, or the rate at which the models learn from more demonstrating examples. We find that while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at https://github.com/stanfordmlgroup/ManyICL .
翻译:大型语言模型在少样本上下文学习(ICL)方面的高效性已广为人知。多模态基础模型的最新进展使得前所未有的长上下文窗口成为可能,这为探索其在更多演示示例下执行ICL的能力提供了机遇。本研究评估了多模态基础模型从少样本到多样本ICL的性能表现。我们在涵盖多个领域(自然图像、医学影像、遥感图像及分子图像)和任务(多类别、多标签及细粒度分类)的10个数据集上,对GPT-4o和Gemini 1.5 Pro进行了基准测试。结果发现,与少样本(<100个示例)ICL相比,包含近2000个多模态演示示例的多样本ICL在所有数据集上均带来了显著性能提升。此外,在多数数据集中,Gemini 1.5 Pro的性能随测试样本数量增加呈对数线性持续提升。针对多样本ICL所需长提示带来的高推理成本问题,我们还探索了单次API调用中批处理多个查询的影响。研究表明,在零样本和多样本ICL场景下,批处理多达50个查询可在多个数据集上提升零样本设置的性能表现,同时大幅降低单次查询的成本与延迟。最后,我们衡量了模型的ICL数据效率,即模型通过更多演示示例的学习速率。结果发现,尽管GPT-4o与Gemini 1.5 Pro在各数据集上的零样本性能相近,但Gemini 1.5 Pro在大多数数据集上表现出更高的ICL数据效率。我们的研究结果表明,多样本ICL可帮助用户高效地将多模态基础模型适配至新的应用领域。相关代码已开源至 https://github.com/stanfordmlgroup/ManyICL。