The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI's problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules -- surpassing concurrent datasets by several orders of magnitude -- across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.
翻译:组合已学习概念并将其应用于新情境的能力是人类智能的关键,但仍是当前最先进机器学习模型持续存在的局限。为应对这一问题,我们提出了COGITAO——一个模块化、可扩展的数据生成框架与基准测试集,旨在系统性地研究视觉领域中的组合性与泛化能力。受ARC-AGI问题设置的启发,COGITAO构建了基于规则的任务,这些任务将一系列变换应用于网格状环境中的对象。该框架支持对28种可互操作的变换进行可调节深度的组合操作,并允许对网格参数化与对象属性进行广泛控制。这种灵活性使得能够创建数百万个独特的任务规则(其规模超过同期数据集数个数量级),涵盖从简单到复杂的广泛难度范围,同时支持为每条规则生成近乎无限的样本。我们使用最先进的视觉模型进行了基线实验,结果表明尽管模型在领域内表现优异,却始终无法泛化至熟悉元素的新颖组合。COGITAO已完全开源,包含所有代码与数据集,以支持该领域的持续研究。