Diffusion models based on Multi-Head Attention (MHA) have become ubiquitous to generate high quality images and videos. However, encoding an image or a video as a sequence of patches results in costly attention patterns, as the requirements both in terms of memory and compute grow quadratically. To alleviate this problem, we propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens. This explicit state also allows us to generate frames in a sequential fashion, minimizing memory and compute requirement, while still being able to train in parallel. We show the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA. We adapt several Diffusion Transformers (DiT) for generating images and videos with PoM replacing MHA, and we obtain high quality samples while using less computational resources. The code is available at https://github.com/davidpicard/HoMM.
翻译:基于多头注意力(MHA)的扩散模型已成为生成高质量图像与视频的普遍方法。然而,将图像或视频编码为补丁序列会导致计算代价高昂的注意力模式,因为其对内存和计算资源的需求均呈二次方增长。为缓解此问题,我们提出一种名为多项式混合器(PoM)的MHA即插即用替代方案,其优势在于能将整个序列编码为显式状态。PoM的复杂度相对于令牌数量呈线性增长。这种显式状态还允许我们以序列化方式生成帧,在保持并行训练能力的同时,最小化内存与计算需求。我们证明多项式混合器与常规MHA相同,是通用的序列到序列逼近器。我们通过用PoM替换MHA,对多个扩散Transformer(DiT)进行适配以生成图像和视频,在消耗更少计算资源的同时获得了高质量生成样本。代码发布于 https://github.com/davidpicard/HoMM。