Despite the recent efforts in accurate 3D annotations in hand and object datasets, there still exist gaps in 3D hand and object reconstructions. Existing works leverage contact maps to refine inaccurate hand-object pose estimations and generate grasps given object models. However, they require explicit 3D supervision which is seldom available and therefore, are limited to constrained settings, e.g., where thermal cameras observe residual heat left on manipulated objects. In this paper, we propose a novel semi-supervised framework that allows us to learn contact from monocular images. Specifically, we leverage visual and geometric consistency constraints in large-scale datasets for generating pseudo-labels in semi-supervised learning and propose an efficient graph-based network to infer contact. Our semi-supervised learning framework achieves a favourable improvement over the existing supervised learning methods trained on data with `limited' annotations. Notably, our proposed model is able to achieve superior results with less than half the network parameters and memory access cost when compared with the commonly-used PointNet-based approach. We show benefits from using a contact map that rules hand-object interactions to produce more accurate reconstructions. We further demonstrate that training with pseudo-labels can extend contact map estimations to out-of-domain objects and generalise better across multiple datasets.
翻译:尽管近年来在手部和物体数据集的精确三维标注方面取得了进展,但三维手部与物体的重建仍存在差距。现有研究利用接触图来修正不准确的手-物体姿态估计,并在给定物体模型的情况下生成抓取动作。然而,这些方法需要显式的三维监督信息,而这种信息很少可用,因此仅限于受限场景(例如,通过热成像相机观察操作物体上残留热量的情况)。本文提出了一种新颖的半监督框架,能够从单目图像中学习接触信息。具体而言,我们在大规模数据集中利用视觉与几何一致性约束来生成半监督学习所需的伪标签,并提出一种高效的基于图的网络来推断接触。与仅使用“有限”标注数据训练的传统监督学习方法相比,我们的半监督学习框架取得了显著改进。值得注意的是,与常用的基于PointNet的方法相比,所提模型在参数数量和内存访问成本减半的情况下仍能获得更优结果。我们证明,利用接触图约束手-物体交互有助于生成更精确的重建结果。进一步实验表明,通过伪标签训练可将接触图估计扩展至域外物体,并在多个数据集上展现更强的泛化能力。