This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?
翻译:本文研究了大视觉语言模型(LVLM)时代的集成方法。集成是一种通过组合不同模型以提升性能的经典方法。在近期关于百科全书式VQA的研究中,作者考察了多种模型来解决该任务:从原始LVLM,到将图像描述作为额外上下文的模型,再到基于透镜检索维基百科页面增强的模型。直觉上,这些模型具有高度互补性,应成为集成的理想选择。的确,一项oracle实验表明,潜在性能提升可从最佳单一模型的48.8%准确率跃升至67%(最佳集成可能性)。那么,构建一个能带来实质性收益的集成是否就成了一个简单的练习?事实果真如此吗?