Systematic compositionality, or the ability to adapt to novel situations by creating a mental model of the world using reusable pieces of knowledge, remains a significant challenge in machine learning. While there has been considerable progress in the language domain, efforts towards systematic visual imagination, or envisioning the dynamical implications of a visual observation, are in their infancy. We introduce the Systematic Visual Imagination Benchmark (SVIB), the first benchmark designed to address this problem head-on. SVIB offers a novel framework for a minimal world modeling problem, where models are evaluated based on their ability to generate one-step image-to-image transformations under a latent world dynamics. The framework provides benefits such as the possibility to jointly optimize for systematic perception and imagination, a range of difficulty levels, and the ability to control the fraction of possible factor combinations used during training. We provide a comprehensive evaluation of various baseline models on SVIB, offering insight into the current state-of-the-art in systematic visual imagination. We hope that this benchmark will help advance visual systematic compositionality.
翻译:系统组合性——即通过可复用知识片段构建世界心智模型以适应新情境的能力——仍是机器学习领域的重大挑战。尽管语言领域已取得长足进展,但在系统性视觉想象(即通过视觉观察推演动态演化结果)方面的研究仍处于起步阶段。我们提出系统性视觉想象基准(SVIB),这是首个直面该问题的基准测试。SVIB为最小化世界建模问题提供了创新框架,要求模型在潜在世界动态规则下生成单步图像到图像的变换能力。该框架具有以下优势:可联合优化系统性感知与想象能力、多难度层级覆盖、可控制训练用因子组合比例。我们全面评估了SVIB上的多种基线模型,揭示了当前系统性视觉想象技术的最前沿水平。期望本基准能推动视觉系统组合性研究的发展。