Object-centric (OC) representations, which represent the state of a visual scene by modeling it as a composition of objects, have the potential to be used in various downstream tasks to achieve systematic compositional generalization and facilitate reasoning. However, these claims have not been thoroughly analyzed yet. Recently, foundation models have demonstrated unparalleled capabilities across diverse domains from language to computer vision, marking them as a potential cornerstone of future research for a multitude of computational tasks. In this paper, we conduct an extensive empirical study on representation learning for downstream Visual Question Answering (VQA), which requires an accurate compositional understanding of the scene. We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches including large pre-trained foundation models on both synthetic and real-world data, and demonstrate a viable way to achieve the best of both worlds. The extensiveness of our study, encompassing over 800 downstream VQA models and 15 different types of upstream representations, also provides several additional insights that we believe will be of interest to the community at large.
翻译:以对象为中心(OC)的表示通过将视觉场景建模为对象的组合来表示其状态,这种表示方法有潜力应用于各种下游任务,以实现系统性的组合泛化并促进推理。然而,这些主张尚未得到深入分析。最近,基础模型在从语言到计算机视觉的多个领域展现出无与伦比的能力,使其成为未来众多计算任务研究的潜在基石。本文针对下游视觉问答(VQA)任务中的表示学习进行了广泛的实证研究,该任务要求对场景具备精确的组合理解。我们全面探究了OC模型与替代方法(包括大型预训练基础模型)在合成数据和真实数据上的优势与权衡,并展示了一种实现两者优势互补的可行路径。本研究涵盖了超过800个下游VQA模型和15种不同类型的上游表示,其广泛性也为学界提供了若干我们认为具有普遍参考价值的额外见解。