Generalizing tool manipulation requires both semantic planning and precise physical control. Modern generalist robot policies, such as Vision-Language-Action (VLA) models, often lack the high-fidelity physical grounding required for contact-rich tool manipulation. Conversely, existing contact-aware policies that leverage tactile or haptic sensing are typically instance-specific and fail to generalize across diverse tool geometries. Bridging this gap requires learning unified contact representations from diverse data, yet a fundamental barrier remains: diverse real-world tactile data are prohibitive at scale, while direct zero-shot sim-to-real transfer is challenging due to the complex dynamics of nonlinear deformation of soft sensors. To address this, we propose Semantic-Contact Fields (SCFields), a unified 3D representation fusing visual semantics with dense contact estimates. We enable this via a two-stage Sim-to-Real Contact Learning Pipeline: first, we pre-train on a large simulation data set to learn general contact physics; second, we fine-tune on a small set of real data, pseudo-labeled via geometric heuristics and force optimization, to align sensor characteristics. This allows physical generalization to unseen tools. We leverage SCFields as the dense observation input for a diffusion policy to enable robust execution of contact-rich tool manipulation tasks. Experiments on scraping, crayon drawing, and peeling demonstrate robust category-level generalization, significantly outperforming vision-only and raw-tactile baselines.
翻译:工具操作的泛化既需要语义规划,也需要精确的物理控制。现代通用机器人策略,如视觉-语言-动作模型,通常缺乏接触密集型工具操作所需的高保真物理基础。相反,现有的利用触觉或力觉感知的接触感知策略通常是针对特定实例的,无法在不同工具几何形态之间实现泛化。弥合这一差距需要从多样化数据中学习统一的接触表示,但一个根本障碍依然存在:多样化的真实世界触觉数据难以大规模获取,而由于软传感器非线性变形的复杂动力学特性,直接的零样本仿真到真实迁移具有挑战性。为此,我们提出语义接触场,这是一种融合视觉语义与密集接触估计的统一三维表示。我们通过一个两阶段的仿真到真实接触学习流水线实现这一目标:首先,我们在大型仿真数据集上进行预训练以学习通用接触物理;其次,我们在少量真实数据上进行微调,这些数据通过几何启发式方法和力优化进行伪标注,以对齐传感器特性。这使得物理泛化能够应用于未见过的工具。我们利用语义接触场作为扩散策略的密集观测输入,以实现接触密集型工具操作任务的鲁棒执行。在刮擦、蜡笔画和剥离任务上的实验证明了鲁棒的类别级泛化能力,其性能显著优于仅使用视觉和原始触觉的基线方法。