FloAt：基于流形变的自注意力机制用于服装动画生成 (FloAt: Flow Warping of Self-Attention for Clothing Animation Generation)

We propose a diffusion model-based approach, FloAtControlNet to generate cinemagraphs composed of animations of human clothing. We focus on human clothing like dresses, skirts and pants. The input to our model is a text prompt depicting the type of clothing and the texture of clothing like leopard, striped, or plain, and a sequence of normal maps that capture the underlying animation that we desire in the output. The backbone of our method is a normal-map conditioned ControlNet which is operated in a training-free regime. The key observation is that the underlying animation is embedded in the flow of the normal maps. We utilize the flow thus obtained to manipulate the self-attention maps of appropriate layers. Specifically, the self-attention maps of a particular layer and frame are recomputed as a linear combination of itself and the self-attention maps of the same layer and the previous frame, warped by the flow on the normal maps of the two frames. We show that manipulating the self-attention maps greatly enhances the quality of the clothing animation, making it look more natural as well as suppressing the background artifacts. Through extensive experiments, we show that the method proposed beats all baselines both qualitatively in terms of visual results and user study. Specifically, our method is able to alleviate the background flickering that exists in other diffusion model-based baselines that we consider. In addition, we show that our method beats all baselines in terms of RMSE and PSNR computed using the input normal map sequences and the normal map sequences obtained from the output RGB frames. Further, we show that well-established evaluation metrics like LPIPS, SSIM, and CLIP scores that are generally for visual quality are not necessarily suitable for capturing the subtle motions in human clothing animations.

翻译：我们提出了一种基于扩散模型的方法——FloAtControlNet，用于生成由人类服装动画构成的动态摄影（cinemagraphs）。我们专注于连衣裙、短裙和裤子等人类服装。模型的输入包括描述服装类型（如豹纹、条纹或纯色）的文本提示，以及捕获期望输出动画效果的法线贴图序列。我们方法的骨干网络是一个以法线贴图为条件的ControlNet，该网络在无需额外训练的条件下运行。关键观察发现，底层动画信息嵌入在法线贴图的流场中。我们利用由此获得的流场来操纵特定层的自注意力图。具体而言，特定层和帧的自注意力图被重新计算为其自身与同一层前一帧自注意力图的线性组合，其中前一帧的自注意力图通过两帧间法线贴图的流场进行形变。我们证明，操纵自注意力图能显著提升服装动画的质量，使其看起来更自然，同时抑制背景伪影。通过大量实验，我们表明所提方法在视觉效果的定性评估和用户研究中均优于所有基线方法。具体而言，我们的方法能够缓解其他基于扩散模型的基线方法中存在的背景闪烁问题。此外，我们通过计算输入法线贴图序列与输出RGB帧提取的法线贴图序列之间的均方根误差（RMSE）和峰值信噪比（PSNR），证明我们的方法在所有基线中表现最优。进一步地，我们发现通常用于视觉质量评估的成熟指标（如LPIPS、SSIM和CLIP分数）并不一定适用于捕捉人类服装动画中的细微运动。