Structural understanding of complex visual objects is an important unsolved component of artificial intelligence. To study this, we develop a new technique for the recently proposed Break-and-Make problem in LTRON where an agent must learn to build a previously unseen LEGO assembly using a single interactive session to gather information about its components and their structure. We attack this problem by building an agent that we call \textbf{\ours} that is able to make its own visual instruction book. By disassembling an unseen assembly and periodically saving images of it, the agent is able to create a set of instructions so that it has the information necessary to rebuild it. These instructions form an explicit memory that allows the model to reason about the assembly process one step at a time, avoiding the need for long-term implicit memory. This in turn allows us to train on much larger LEGO assemblies than has been possible in the past. To demonstrate the power of this model, we release a new dataset of procedurally built LEGO vehicles that contain an average of 31 bricks each and require over one hundred steps to disassemble and reassemble. We train these models using online imitation learning which allows the model to learn from its own mistakes. Finally, we also provide some small improvements to LTRON and the Break-and-Make problem that simplify the learning environment and improve usability.
翻译:复杂视觉对象的结构理解是人工智能领域尚未解决的重要课题。为研究此问题,我们针对LTRON中近期提出的"拆解与搭建"问题开发了一种新技术——在该任务中,智能体必须通过单次交互会话来收集组件及其结构信息,从而学会搭建先前未见过的乐高组装体。我们通过构建名为\textbf{\ours}的智能体来解决该问题,该智能体能够自主创建可视化说明书。通过拆解未知组装体并周期性保存其图像,智能体能够创建一套包含重建所需全部信息的指令集。这些指令构成显式记忆,使模型能够逐步推理组装过程,从而避免对长期隐式记忆的依赖。这使得我们能够训练比以往规模更大的乐高组装体。为展示该模型的性能,我们发布了由程序化生成的乐高载具新数据集,其中每个模型平均包含31个积木单元,需要百余个步骤完成拆解与重组。我们采用在线模仿学习训练这些模型,使模型能够从自身错误中学习。最后,我们还对LTRON环境及"拆解与搭建"问题进行了若干改进,简化了学习环境并提升了可用性。