Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios -- body-driven garments, robotic manipulation, and free-fall collisions -- under a single model and achieves approximately $4$--$9{\times}$ lower error than prior state-of-the-art methods across all scenarios; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ${\sim}$493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts.
翻译:近年来,统一且可扩展的Transformer在建模传统计算机图形学中的多种现象(如三维视觉效果、渲染过程以及视频运动)方面取得了显著成功。本文进一步探究现代Transformer技术能否攻克布料仿真这一挑战性任务。为此,我们提出ClothTransformer框架,将布料仿真重新表述为在学得潜空间中的自回归序列建模。现有神经布料模拟器大多局限于单一场景、固有地与网格离散化耦合且缺乏鲁棒碰撞处理。我们的方法通过三项贡献解决这些局限:(1)统一Transformer架构,在单一模型下处理多样场景(人体驱动服装、机器人操作与自由落体碰撞),并在所有场景中实现比先前最优方法约低4-9倍的误差;(2)可扩展潜空间公式,将任意分辨率网格压缩为固定大小的潜令牌集合,使时间动态计算独立于网格分辨率;(3)涵盖全部三类设定的高保真无穿透多样化场景数据集(约49.34万帧),支持可微连续碰撞检测模块抑制穿透伪影。