We study zero-shot 3D alignment of two given meshes, using a text prompt describing their spatial relation -- an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer, without training a new model. Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, and camera control concentrates the optimization on the interaction region. To enable evaluation, we curate a benchmark containing diverse categories and relations, and compare against baselines. Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.
翻译:本研究探讨在给定两个网格模型的情况下,利用描述其空间关系的文本提示实现零样本三维对齐——这是内容创作与场景组装的核心能力。现有方法主要依赖几何对齐流程,而近期研究则利用预训练的二维扩散模型来建模语言约束下的物体间空间关系。与之不同,我们直接在测试阶段优化相对位姿,通过可微分渲染器结合CLIP驱动的梯度更新平移、旋转与各向同性缩放参数,无需训练新模型。本框架通过几何感知目标增强语言监督:采用改进的软迭代最近点(ICP)项以促进表面贴合,并引入穿透损失以抑制相互穿透。阶段性优化策略随时间强化接触约束,相机控制机制则将优化聚焦于交互区域。为建立评估基准,我们构建了涵盖多类别与多关系的测试集,并与基线方法进行比较。实验表明,本方法在语义准确性与物理合理性方面均优于现有方案,实现了更优的对齐效果。