Robotic manipulation in unstructured environments requires reliable execution under diverse conditions, yet many state-of-the-art systems still struggle with high-dimensional action spaces, sparse rewards, and slow generalization beyond carefully curated training scenarios. We study these limitations through the example of grasping in space environments. We learn control policies directly in a learned latent manifold that fuses (grammarizes) multiple modalities into a structured representation for policy decision-making. Building on GPU-accelerated physics simulation, we instantiate a set of single-shot manipulation tasks and achieve over 95% task success with Soft Actor-Critic (SAC)-based reinforcement learning in less than 1M environment steps, under continuously varying grasping conditions from step 1. This empirically shows faster convergence than representative state-of-the-art visual baselines under the same open-loop single-shot conditions. Our analysis indicates that explicitly reasoning in latent space yields more sample-efficient learning and improved robustness to novel object and gripper geometries, environmental clutter, and sensor configurations compared to standard baselines. We identify remaining limitations and outline directions toward fully adaptive and generalizable grasping in the extreme conditions of space.
翻译:非结构化环境中的机器人操作需要在多样化条件下可靠执行,然而许多先进系统仍难以应对高维动作空间、稀疏奖励以及在精心设计的训练场景之外泛化缓慢的问题。本研究以空间环境中的抓取任务为例探讨这些局限性。我们通过在学习的潜变量流形中直接学习控制策略,该流形将多模态信息融合(结构化)为策略决策的结构化表征。基于GPU加速的物理仿真,我们实例化了一组单次操作任务,并在持续变化的抓取条件下(从第一步开始),采用基于柔性演员-评论家(SAC)的强化学习在不足100万环境步数内实现了超过95%的任务成功率。实验表明,在相同的开环单次操作条件下,该方法比代表性的先进视觉基线具有更快的收敛速度。分析表明,与标准基线相比,在潜空间中进行显式推理能够实现更高样本效率的学习,并对新物体/夹爪几何形态、环境杂乱度和传感器配置具有更强的鲁棒性。我们指出了当前方法的局限性,并展望了在极端空间环境下实现完全自适应与可泛化抓取的研究方向。