Autonomous manipulation of articulated objects remains a fundamental challenge for robots in human environments. Vision-based methods can infer hidden kinematics but can yield imprecise estimates on unfamiliar objects. Tactile approaches achieve robust control through contact feedback but require accurate initialization. This suggests a natural synergy: vision for global guidance, touch for local precision. Yet no framework systematically exploits this complementarity for generalized articulated manipulation. Here we present Vi-TacMan, which uses vision to propose grasps and coarse directions that seed a tactile controller for precise execution. By incorporating surface normals as geometric priors and modeling directions via von Mises-Fisher distributions, our approach achieves significant gains over baselines (all p<0.0001). Critically, manipulation succeeds without explicit kinematic models -- the tactile controller refines coarse visual estimates through real-time contact regulation. Tests on more than 50,000 simulated and diverse real-world objects confirm robust cross-category generalization. This work establishes that coarse visual cues suffice for reliable manipulation when coupled with tactile feedback, offering a scalable paradigm for autonomous systems in unstructured environments.
翻译:在人类环境中,机器人对铰接物体的自主操作仍是一项根本性挑战。基于视觉的方法可以推断隐藏的运动学特性,但在陌生物体上可能产生不精确的估计。触觉方法通过接触反馈实现鲁棒控制,但需要精确的初始化。这揭示了一种自然的协同作用:视觉提供全局引导,触觉实现局部精度。然而,目前尚无框架能系统性地利用这种互补性实现广义的铰接物体操作。本文提出Vi-TacMan,它利用视觉提出抓取位置和粗略方向,以此为种子驱动触觉控制器进行精确执行。通过引入表面法线作为几何先验,并利用冯·米塞斯-费希尔分布对方向进行建模,我们的方法相较于基线取得了显著提升(所有p<0.0001)。关键的是,操作成功无需显式的运动学模型——触觉控制器通过实时接触调节来优化粗略的视觉估计。在超过50,000个模拟及多样化真实物体上的测试证实了其跨类别的鲁棒泛化能力。本研究表明,当粗略的视觉线索与触觉反馈相结合时,足以实现可靠的操作,这为无结构环境中的自主系统提供了一种可扩展的范式。