Generating realistic human grasps is crucial yet challenging for object manipulation in computer graphics and robotics. Current methods often struggle to generate detailed and realistic grasps with full finger-object interaction, as they typically rely on encoding the entire hand and estimating both posture and position in a single step. Additionally, simulating object deformation during grasp generation is still difficult, as modeling such deformation requires capturing the comprehensive relationship among points of the object's surface. To address these limitations, we propose a novel improved Decomposed Vector-Quantized Variational Autoencoder (DVQ-VAE-2), which decomposes the hand into distinct parts and encodes them separately. This part-aware architecture allows for more precise management of hand-object interactions. Furthermore, we introduce a dual-stage decoding strategy that first predicts the grasp type under skeletal constraints and then identifies the optimal grasp position, enhancing both the realism and adaptability of the model to unseen interactions. Furthermore, we introduce a new Mesh UFormer as the backbone network to extract the hierarchical structural representations from the mesh and propose a new normal vector-guided position encoding to simulate the hand-object deformation. In experiments, our model achieves a relative improvement of approximately 14.1% in grasp quality compared to state-of-the-art methods across four widely used benchmarks. Our comparisons with other backbone networks show relative improvements of 2.23% in Hand-object Contact Distance and 5.86% in Quality Index on deformable and rigid object based datasets, respectively. Our source code and model are available at https://github.com/florasion/D-VQVAE.
翻译:生成真实的人手抓取姿态对于计算机图形学和机器人学中的物体操控至关重要,但也极具挑战性。现有方法通常难以生成具有完整手指-物体交互细节的真实抓取姿态,因为它们通常依赖于编码整个手部并在单一步骤中同时估计姿态和位置。此外,在抓取生成过程中模拟物体变形仍然困难,因为对此类变形的建模需要捕捉物体表面点之间的复杂关系。为应对这些局限,我们提出了一种新颖的改进型分解向量量化变分自编码器(DVQ-VAE-2),它将手部分解为不同部件并分别进行编码。这种部件感知架构能够更精确地处理手-物交互。此外,我们引入了一种双阶段解码策略:首先在骨骼约束下预测抓取类型,随后确定最优抓取位置,从而增强了模型的真实感及其对未见交互的适应性。进一步地,我们提出了一种新的Mesh UFormer作为骨干网络,用于从网格中提取层次化结构表征,并引入了一种新的法向量引导位置编码来模拟手-物变形。在实验中,我们的模型在四个广泛使用的基准测试上,与最先进方法相比,抓取质量实现了约14.1%的相对提升。与其他骨干网络的对比显示,在基于可变形物体和刚性物体的数据集上,我们的方法在手-物接触距离指标上相对提升了2.23%,在质量指数上相对提升了5.86%。我们的源代码和模型公开于 https://github.com/florasion/D-VQVAE。