Text-conditioned 3D generation has progressed rapidly for images and isolated objects, but producing a hand-object mesh remains challenging: the output must preserve language semantics, cross-view consistency, object geometry, articulated hand shape, and physically plausible contact. We present TextHOI-3D, a staged framework that uses generated multi-view observations as an explicit interface between text-conditioned visual generation and geometry-aware hand-object recovery. TextHOI-3D learns a compact VQ token space for fixed-camera hand-object observations, predicts multi-view visual tokens from text with a CLIP-conditioned visual autoregressive model, and recovers a unified hand-object mesh through prior initialization, multi-view joint optimization, and anti-penetration refinement. The design separates semantic generation from geometric recovery while keeping both stages connected by a discrete multi-view representation. On HO3D-derived evaluations, the multi-view setting reduces object CD from 17.26 mm to 4.92 mm and penetration volume from 5.3721 cm^3 to 0.2193 cm^3 compared with a single-view counterpart, while improving hand errors and surface F-scores. These results support multi-view visual tokens as an effective intermediate representation for text-driven 3D hand-object mesh creation.
翻译:文本驱动的三维生成技术在图像和孤立物体领域已取得快速发展,但生成手物交互网格仍面临挑战:输出结果需同时保持语言语义一致性、跨视角一致性、物体几何结构、关节手部形态及物理合理的接触关系。本文提出TextHOI-3D框架,通过将生成的多视角观测作为文本条件视觉生成与几何感知手物重建之间的显式接口,实现分阶段处理。该框架首先为固定相机视角下的手物观测构建紧凑的VQ标记空间,利用CLIP条件视觉自回归模型从文本预测多视角视觉标记,再通过先验初始化、多视角联合优化及抗穿透细化模块恢复统一的手物网格。这种设计将语义生成与几何重建分离,同时通过离散多视角表示保持两阶段关联。在HO3D衍生数据集上的评估表明,相比单视角方法,多视角设置将物体倒角距离从17.26mm降至4.92mm,穿透体积从5.3721cm³降至0.2193cm³,同时改善手部误差和表面F分数。实验结果证实,多视角视觉标记可作为文本驱动三维手物网格生成的有效中间表示。