Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.
翻译:在边缘设备上部署实时空间感知需要高效的多任务模型,这些模型能够利用互补的任务信息,同时最小化计算开销。本文提出Multi-Mono-Hydra(M2H),一种新颖的多任务学习框架,专为从单目图像进行语义分割以及深度、边缘和表面法线估计而设计。与依赖独立单任务模型或共享编码器-解码器架构的传统方法不同,M2H引入了基于窗口的跨任务注意力模块,该模块能够在保持任务特定细节的同时实现结构化特征交换,从而提升跨任务预测的一致性。M2H基于轻量级的ViT架构DINOv2骨干网络构建,针对实时部署进行了优化,并可作为支持动态环境中3D场景图构建的单目空间感知系统的基础。综合评估表明,M2H在NYUDv2数据集上超越了最先进的多任务模型,在Hypersim数据集上超过了单任务深度和语义基线,并在Cityscapes数据集上实现了优越性能,同时均在笔记本电脑硬件上保持了计算效率。除基准测试外,M2H在真实世界数据上得到了验证,证明了其在空间感知任务中的实用性。