While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.
翻译:尽管基于网络大规模图文数据的预训练推动了众多视觉与语言(V&L)任务的快速发展,但近期研究表明,预训练模型缺乏“细粒度”理解能力,例如识别图像中的关系、动词和数字。这促使学界日益关注开发针对此类能力的基准或模型。为更好地理解并量化该方向的进展,本文在四个细粒度基准上探究了四种具有竞争力的V&L模型。通过分析发现,X-VLM(Zeng等,2022)持续优于其他基线模型,且模型创新对性能的影响远超网络数据的规模扩展——后者有时甚至会导致性能下降。通过深入剖析X-VLM,我们强调了新型损失函数与丰富数据源对学习细粒度技能的重要性。最后,通过检查训练动态,我们发现某些任务在训练初期即达到性能峰值,或出现显著波动且始终无法收敛。