Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios -- body-driven garments, robotic manipulation, and free-fall collisions -- under a single model and achieves approximately $4$--$9{\times}$ lower error than prior state-of-the-art methods across all scenarios; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ${\sim}$493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts. Project Page: https://yucrazing.github.io/clothtransformer/
翻译:统一且可扩展的Transformer近期在模拟传统上与计算机图形学相关的多种现象(如三维视觉效果、渲染过程及视频中的运动)方面取得了显著成功。本文进一步探索现代Transformer技术能否应对布料模拟这一挑战性任务。为此,我们提出ClothTransformer——一个将布料模拟重构为在已学习潜在空间中进行自回归序列建模的框架。现有神经布料模拟器高度特化于单一场景,与网格离散化固有耦合,且缺乏鲁棒的碰撞处理。我们的方法通过三项贡献解决上述局限:(1)统一Transformer架构,可在单一模型下处理多样化场景(人体驱动服装、机器人操控及自由落体碰撞),并在所有场景中实现比现有最优方法低约$4$--$9{\times}$的误差;(2)可扩展的潜在空间公式,将任意分辨率网格压缩为固定大小的潜在令牌集合,使时间动态计算与网格分辨率解耦;(3)覆盖全部三种设置的高保真无穿透数据集(约493.4k帧),支持可微分的连续碰撞检测模块以抑制穿透伪影。项目页面:https://yucrazing.github.io/clothtransformer/