The deployment of humanoid robots for dexterous manipulation in unstructured environments remains challenging due to perceptual limitations that constrain the effective workspace. In scenarios where physical constraints prevent the robot from repositioning itself, maintaining omnidirectional awareness becomes far more critical than color or semantic information. While recent advances in visuomotor policy learning have improved manipulation capabilities, conventional RGB-D solutions suffer from narrow fields of view (FOV) and self-occlusion, requiring frequent base movements that introduce motion uncertainty and safety risks. Existing approaches to expanding perception, including active vision systems and third-view cameras, introduce mechanical complexity, calibration dependencies, and latency that hinder reliable real-time performance. In this work, We propose Omni-Manip, an end-to-end LiDAR-driven 3D visuomotor policy that enables robust manipulation in large workspaces. Our method processes panoramic point clouds through a Time-Aware Attention Pooling mechanism, efficiently encoding sparse 3D data while capturing temporal dependencies. This 360° perception allows the robot to interact with objects across wide areas without frequent repositioning. To support policy learning, we develop a whole-body teleoperation system for efficient data collection on full-body coordination. Extensive experiments in simulation and real-world environments show that Omni-Manip achieves robust performance in large-workspace and cluttered scenarios, outperforming baselines that rely on egocentric depth cameras.
翻译:在非结构化环境中部署人形机器人进行灵巧操作仍然具有挑战性,这主要受限于感知能力对有效工作空间的约束。在物理条件限制机器人重新定位自身的场景中,保持全向感知远比颜色或语义信息更为关键。尽管视觉运动策略学习的最新进展提升了操作能力,但传统的RGB-D解决方案存在视场狭窄和自遮挡问题,需要频繁的基座移动,从而引入了运动不确定性和安全风险。现有扩展感知的方法,包括主动视觉系统和第三方视角摄像头,会带来机械复杂性、校准依赖性和延迟,从而阻碍了可靠的实时性能。本研究提出Omni-Manip,一种端到端的激光雷达驱动的三维视觉运动策略,能够实现大工作空间内的鲁棒操作。我们的方法通过时间感知注意力池化机制处理全景点云,在捕捉时间依赖性的同时高效编码稀疏三维数据。这种360°感知使机器人能够在无需频繁重新定位的情况下与广阔区域内的物体交互。为支持策略学习,我们开发了一套全身遥操作系统,用于高效收集全身协调数据。在仿真和真实环境中的大量实验表明,Omni-Manip在大工作空间和杂乱场景中均实现了鲁棒性能,优于依赖以自我为中心深度摄像头的基线方法。