Interpreting dynamic, heterogeneous multimedia commands with real-time responsiveness is critical for Human-Robot Interaction. We present VA-FastNavi-MARL, a framework that aligns asynchronous audio-visual inputs into a unified latent representation. By treating diverse instructions as a distribution of navigable goals via Meta-Reinforcement Learning, our method enables rapid adaptation to unseen directives with negligible inference overhead. Unlike approaches bottlenecked by heavy sensory processing, our modality-agnostic stream ensures seamless, low-latency control. Validation on a multi-arm workspace confirms that VA-FastNavi-MARL significantly outperforms baselines in sample efficiency and maintains robust, real-time execution even under noisy multimedia streams.
翻译:针对人机交互中动态异构多媒体指令的实时解析需求,本文提出VA-FastNavi-MARL框架,通过将异步视听输入对齐至统一潜在表征空间,实现实时响应。该方法将多样化指令视为元强化学习框架下可导航目标的概率分布,从而以极小的推理开销快速适应未见指令。区别于受限于高开销感知处理的传统方案,本框架采用模态无关化数据流,保障低时延无缝控制。多机械臂工作台验证表明,VA-FastNavi-MARL在样本效率上显著优于基线方法,即使在含噪多媒体流环境下仍能保持鲁棒的实时执行能力。