Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- zeng.github. io/projects/gmt/.
翻译:摘要:在三维环境中合成可控的6自由度物体操控轨迹,对于实现机器人与复杂场景的交互至关重要,但由于需要精确的空间推理、物理可行性以及多模态场景理解,这一任务仍具挑战性。现有方法常依赖二维或局部三维表征,限制了其捕捉完整场景几何结构的能力,并制约了轨迹精度。我们提出GMT——一个多模态Transformer框架,通过联合利用3D边界框几何、点云上下文、语义物体类别及目标末端姿态,生成真实且导向目标的物体轨迹。该模型将轨迹表示为连续的6自由度姿态序列,并采用定制化的条件策略,融合几何、语义、上下文及目标导向信息。在合成与真实世界基准上的大量实验表明,GMT优于现有的最优人体运动与人-物交互基线方法(如CHOIS和GIMO),在空间精度与方向控制上取得显著提升。我们的方法为基于学习的操控规划建立了新基准,并展现出对多样化物体及杂乱3D环境的强泛化能力。项目主页:https://huajian-zeng.github.io/projects/gmt/。