GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

Whole-body mobile manipulation requires coordinating mobile base and manipulator under shifting viewpoints, posing challenges in geometric perception and action generation. Current policies either rely on 2D features or sparse 3D representations that lack dense spatial structure, and typically encode arm and base within one action vector that ignores their distinct control demands. Moreover, existing dense fusion strategies risk corrupting pretrained representations under noisy depth while incurring heavy computational overhead. We present GeoHAT, an end-to-end diffusion-based framework built on a simple principle: geometry should be injected only where reliable and attended to only where needed. GeoHAT employs a lightweight Fourier spatial encoder that maps dense per-pixel 3D coordinates into geometric tokens without an additional 3D vision backbone. These tokens are then selectively injected into vision foundation model features through per-token gated fusion modulated by depth validity, preserving the semantic prior while enriching spatial understanding. For action generation, a Hybrid Whole-Body Action Decoder decomposes arm and base into distinct subspaces and lets each action modality attend to its task-relevant visual context through sparse cross-attention, while causal temporal modeling captures intra-timestep coordination and inter-timestep dependencies. Experiments on the ManiSkill-HAB simulation benchmark demonstrate that GeoHAT achieves a 79.3% mean success rate, surpassing the strongest baseline by 23.7%. Furthermore, real-world experiments on diverse tasks also confirm consistent improvements over all baselines.

翻译：全身移动操作需要在变化视角下协调移动基座与机械臂的联动，这给几何感知与动作生成带来了挑战。现有策略要么依赖缺乏密集空间结构的二维特征或稀疏三维表示，要么通常将机械臂与基座编码为忽略其不同控制需求的单一动作向量。此外，现有的密集融合策略在噪声深度下存在破坏预训练表征的风险，同时产生高昂的计算开销。我们提出GeoHAT，这是一个基于简单原则构建的端到端扩散框架：几何信息应仅在可靠处注入，并仅在需要处被关注。GeoHAT采用轻量级傅里叶空间编码器，无需额外3D视觉骨干网络即可将密集像素级三维坐标映射为几何标记。这些标记通过基于深度有效性的逐标记门控融合机制选择性注入视觉基础模型特征，在丰富空间理解的同时保留语义先验。在动作生成方面，混合全身动作解码器将机械臂与基座分解为不同子空间，使各动作模态通过稀疏交叉注意力关注其任务相关视觉语境，同时通过因果时序建模捕获帧内协调与帧间依赖关系。在ManiSkill-HAB仿真基准上的实验表明，GeoHAT实现了79.3%的平均成功率，超越最强基线23.7%。此外，在多样化任务上的真实实验也验证了其对所有基线的持续改进。