Natural data is redundant yet predominant architectures tile computation uniformly across their input and output space. We propose the Recurrent Interface Networks (RINs), an attention-based architecture that decouples its core computation from the dimensionality of the data, enabling adaptive computation for more scalable generation of high-dimensional data. RINs focus the bulk of computation (i.e. global self-attention) on a set of latent tokens, using cross-attention to read and write (i.e. route) information between latent and data tokens. Stacking RIN blocks allows bottom-up (data to latent) and top-down (latent to data) feedback, leading to deeper and more expressive routing. While this routing introduces challenges, this is less problematic in recurrent computation settings where the task (and routing problem) changes gradually, such as iterative generation with diffusion models. We show how to leverage recurrence by conditioning the latent tokens at each forward pass of the reverse diffusion process with those from prior computation, i.e. latent self-conditioning. RINs yield state-of-the-art pixel diffusion models for image and video generation, scaling to 1024X1024 images without cascades or guidance, while being domain-agnostic and up to 10X more efficient than 2D and 3D U-Nets.
翻译:自然数据存在冗余,但主流架构却在其输入和输出空间中均匀分配计算。我们提出了循环接口网络(Recurrent Interface Networks,RINs),这是一种基于注意力机制的架构,其核心计算与数据的维度解耦,从而实现对高维数据生成的可扩展自适应计算。RINs将大部分计算(即全局自注意力)集中在潜在令牌上,通过交叉注意力在潜在令牌与数据令牌之间读取和写入(即路由)信息。堆叠RIN模块可实现自底向上(数据到潜在)和自顶向下(潜在到数据)的反馈,形成更深层、更具表达力的路由机制。尽管这种路由引入了挑战,但在循环计算场景中(如扩散模型的迭代生成)任务(及路由问题)逐步变化时,该问题影响较小。我们展示了如何通过将每次逆扩散过程前向传播中的潜在令牌与先前计算产生的潜在令牌进行条件化(即潜在自条件化)来利用循环特性。RINs在图像和视频生成领域实现了最先进的像素扩散模型,无需级联或引导即可扩展到1024×1024图像,同时具备领域无关性,效率比2D和3D U-Nets提升最高达10倍。