The scene perception, understanding, and simulation are fundamental techniques for embodied-AI agents, while existing solutions are still prone to segmentation deficiency, dynamic objects' interference, sensor data sparsity, and view-limitation problems. This paper proposes a novel framework, named SPORTS, for holistic scene understanding via tightly integrating Video Panoptic Segmentation (VPS), Visual Odometry (VO), and Scene Rendering (SR) tasks into an iterative and unified perspective. Firstly, VPS designs an adaptive attention-based geometric fusion mechanism to align cross-frame features via enrolling the pose, depth, and optical flow modality, which automatically adjust feature maps for different decoding stages. And a post-matching strategy is integrated to improve identities tracking. In VO, panoptic segmentation results from VPS are combined with the optical flow map to improve the confidence estimation of dynamic objects, which enhances the accuracy of the camera pose estimation and completeness of the depth map generation via the learning-based paradigm. Furthermore, the point-based rendering of SR is beneficial from VO, transforming sparse point clouds into neural fields to synthesize high-fidelity RGB views and twin panoptic views. Extensive experiments on three public datasets demonstrate that our attention-based feature fusion outperforms most existing state-of-the-art methods on the odometry, tracking, segmentation, and novel view synthesis tasks.
翻译:场景感知、理解与仿真是具身智能体的基础技术,而现有解决方案仍易受分割缺陷、动态物体干扰、传感器数据稀疏性及视角限制等问题影响。本文提出一种名为SPORTS的新型框架,通过将视频全景分割、视觉里程计与场景渲染任务紧密集成至一个迭代且统一的视角,实现整体场景理解。首先,VPS设计了一种基于自适应注意力的几何融合机制,通过引入位姿、深度与光流模态来对齐跨帧特征,该机制可自动调整不同解码阶段的特征图。此外,集成了一种后匹配策略以提升身份跟踪性能。在VO中,将VPS的全景分割结果与光流图相结合,以改进动态物体的置信度估计,从而通过基于学习的范式提升相机位姿估计的精度与深度图生成的完整性。进一步地,SR的基于点云的渲染受益于VO,将稀疏点云转换为神经场以合成高保真RGB视图及孪生全景视图。在三个公开数据集上的大量实验表明,我们基于注意力的特征融合方法在里程计、跟踪、分割及新视角合成任务上优于多数现有最先进方法。