Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception

In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate assistance to human agents intelligently. While executing multiple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene understanding, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, modern perception pipelines still face challenges related to information redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene context. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces computational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the framework's capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.

翻译：在现代人机协作应用中，多个感知模块协同提取视觉、听觉及上下文线索，以实现全面的场景理解，使机器人能够智能地为人类智能体提供适切协助。尽管在离线场景中逐帧执行多个感知模块可提升感知质量，但这不可避免地会累积延迟，导致流式感知场景下的系统性能显著下降。近期在场景理解领域提出的"相关性"概念，为人机协作中高效方法的发展奠定了坚实基础。然而，现代感知流程仍面临信息冗余与计算资源分配欠优的挑战。受相关性概念及人机协作事件中信息稀疏性的启发，我们提出一种新颖的轻量级感知调度框架，该框架能有效利用先前帧的输出，根据场景上下文实时估计并调度必要的感知模块。实验结果表明，与传统并行感知流程相比，所提出的感知调度框架将计算延迟有效降低了最高达27.52%，同时实现了MMPose激活召回率72.73%的提升。此外，该框架展现出较高的关键帧准确率，最高可达98%。这些结果验证了该框架能够在不过度牺牲准确性的前提下，有效提升实时感知效率。该框架有望成为人机协作中多模态流式感知系统的可扩展系统性解决方案。