Realizing dexterous embodied manipulation necessitates the deep integration of heterogeneous multimodal sensory inputs. However, current vision-centric paradigms often overlook the critical force and geometric feedback essential for complex tasks. This paper presents DeMUSE, a Deep Multimodal Unified Sparse Experts framework leveraging a Diffusion Transformer to integrate RGB, depth, and 6-axis force into a unified serialized stream. Adaptive Modality-specific Normalization (AdaMN) is employed to recalibrate modality-aware features, mitigating representation imbalance and harmonizing the heterogeneous distributions of multi-sensory signals. To facilitate efficient scaling, the architecture utilizes a Sparse Mixture-of-Experts (MoE) with shared experts, increasing model capacity for physical priors while maintaining the low inference latency required for real-time control. A Joint denoising objective synchronously synthesizes environmental evolution and action sequences to ensure physical consistency. Achieving success rates of 83.2% and 72.5% in simulation and real-world trials, DeMUSE demonstrates state-of-the-art performance, validating the necessity of deep multi-sensory integration for complex physical interactions.
翻译:实现灵巧的具身操作需要深度融合异构多模态感官输入。然而,当前以视觉为中心的范式常常忽视了复杂任务所必需的关键力觉与几何反馈。本文提出了DeMUSE,一个深度多模态统一稀疏专家框架,它利用扩散Transformer将RGB、深度和六轴力觉信息整合到一个统一的序列化流中。该框架采用自适应模态特定归一化(AdaMN)来重新校准模态感知特征,以缓解表征不平衡问题并协调多感官信号的异构分布。为了促进高效扩展,该架构采用了具有共享专家的稀疏专家混合(MoE)机制,在保持实时控制所需的低推理延迟的同时,增加了模型对物理先验的容量。联合去噪目标同步合成环境演化与动作序列,以确保物理一致性。DeMUSE在仿真和现实世界试验中分别达到了83.2%和72.5%的成功率,展现了最先进的性能,验证了深度多感官融合对于复杂物理交互的必要性。