We introduce T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. Due to the inherent ambiguity of language, generating precisely intended motions using text alone remains challenging. To address this, we adopt 3D trajectories as controllable spatial guidance, specifying the exact paths along which selected points should move. By combining both, T2Mo generates object motions that spatially adhere to the given trajectories while globally reflecting the text semantics. To robustly handle trajectory inputs with arbitrary configurations, ranging from dense to sparse and unevenly distributed, we further propose a shape-grounded trajectory embedding that maps an input trajectory set into a shape-aware token set covering the entire object. We conduct extensive comparisons against text-based baselines and cascaded video-based baselines that combine trajectory-guided video generation with video-to-dynamic mesh generation. Quantitative and qualitative evaluations, along with user studies, demonstrate that our approach produces motions that more faithfully follow the given prompts with higher expressiveness while preserving motion quality.
翻译:我们提出T2Mo,一种基于前馈框架的可控动态三维形状生成方法,其生成过程受3D轨迹与文本条件约束。由于语言固有的歧义性,仅依赖文本生成精确的运动意图仍具挑战性。为此,我们采用3D轨迹作为可控空间引导,明确指定选定点应遵循的精确运动路径。通过结合两种条件,T2Mo生成的物体运动在空间上严格遵循给定轨迹,同时整体反映文本语义。为鲁棒处理从密集到稀疏、分布不均的任意配置轨迹输入,我们进一步提出基于形状的轨迹嵌入方法,将输入轨迹集映射为覆盖整个物体的形状感知令牌集。我们与基于文本的基线方法以及级联视频基线方法(结合轨迹引导的视频生成与视频到动态网格生成)进行了广泛对比。定量与定性评估及用户研究表明,本方法生成的运动在保持运动质量的同时,能更忠实地遵循给定提示且表达力更强。