Mixture-of-Experts (MoE) approaches have recently gained traction in robotics applications due to their ability to dynamically allocate computational resources and specialize sub-networks for distinct tasks or environmental contexts, enabling more efficient decision-making. Such systems often comprise sparsely activated experts combined under a single monolithic architecture and require a well-configured internal routing mechanism, which does not allow for selective low-level expert and router customization and requires additional training. We propose MoIRA, an architecture-agnostic modular MoE framework designed to coordinate existing experts with an external text-based router. MoIRA incorporates two zero-shot routing options: embedding-based similarity and prompt-driven language model inference. In our experiments, we choose large Vision-Language-Action models, gr00t-N1 and $π_0$, as the underlying experts, and train low-rank adapters for low-overhead inference. We evaluate MoIRA on various GR1 Humanoid tasks and LIBERO Spatial and Goal benchmarks, where it consistently outperforms generalist models and competes with other MoE pipelines. Additionally, we analyse the robustness of the proposed approach to the variations of the instructions. While relying solely on textual descriptions of tasks and experts, MoIRA demonstrates the practical viability of modular deployment with precise, low-effort routing and provides an alternative, scalable foundation for future multi-expert robotic systems.
翻译:专家混合(MoE)方法因其能够动态分配计算资源并为不同任务或环境上下文专门化子网络,从而实现更高效的决策制定,最近在机器人学应用中受到关注。此类系统通常包含稀疏激活的专家,这些专家在单一整体架构下组合,并需要一个配置良好的内部路由机制,该机制不允许选择性低层专家和路由器定制,且需要额外训练。我们提出MoIRA,一种架构无关的模块化MoE框架,旨在通过基于文本的外部路由器协调现有专家。MoIRA包含两种零样本路由选项:基于嵌入的相似性和提示驱动的语言模型推理。在我们的实验中,我们选择大型视觉-语言-动作模型gr00t-N1和$π_0$作为底层专家,并训练低秩适配器以实现低开销推理。我们在各种GR1人形机器人任务以及LIBERO空间和目标基准测试中评估MoIRA,其表现始终优于通用模型,并与其他MoE流程相竞争。此外,我们分析了所提方法对指令变化的鲁棒性。尽管仅依赖于任务和专家的文本描述,MoIRA展示了通过精确、低开销路由实现模块化部署的实际可行性,并为未来多专家机器人系统提供了一个可扩展的替代基础。