Vision-based occupancy networks provide an end-to-end solution for reconstructing the surrounding environment using semantic occupied voxels derived from multi-view images. This technique relies on effectively learning the correlation between pixel-level visual information and voxels. Despite recent advancements, occupancy results still suffer from limited accuracy due to occlusions and sparse visual cues. To address this, we propose a Lightweight Spatio-Temporal Correlation (LEAP)} method, which significantly enhances the performance of existing occupancy networks with minimal computational overhead. LEAP can be seamlessly integrated into various baseline networks, enabling a plug-and-play application. LEAP operates in three stages: 1) it tokenizes information from recent baseline and motion features into a shared, compact latent space; 2) it establishes full correlation through a tri-stream fusion architecture; 3) it generates occupancy results that strengthen the baseline's output. Extensive experiments demonstrate the efficiency and effectiveness of our method, outperforming the latest baseline models. The source code and several demos are available in the supplementary material.
翻译:基于视觉的占据网络提供了一种端到端的解决方案,通过从多视角图像中提取语义占据体素来重建周围环境。该技术依赖于有效学习像素级视觉信息与体素之间的关联。尽管近期取得了进展,但由于遮挡和稀疏的视觉线索,占据结果仍存在精度受限的问题。为解决此问题,我们提出了一种轻量级时空关联(LEAP)方法,该方法以极小的计算开销显著提升了现有占据网络的性能。LEAP可以无缝集成到各种基线网络中,实现即插即用的应用。LEAP分三个阶段运行:1)将来自近期基线和运动特征的信息标记化,映射到一个共享的紧凑潜在空间;2)通过三流融合架构建立完整的关联;3)生成增强基线输出的占据结果。大量实验证明了我们方法的高效性和有效性,其性能超越了最新的基线模型。源代码及若干演示可在补充材料中获取。