In this paper, we explore dense voxel streaming for accurate and efficient 3D occupancy prediction. While dense voxel representations offer fine-grained spatial details and streaming paradigm enables efficient temporal processing, naively combining the two introduces key challenges: (i) warping-induced distortions caused by interpolation used for temporal alignment, and (ii) degraded dynamic object representations due to motion misalignment and detail loss in image-to-voxel projection. To address these, we propose StreamOcc, a novel framework that utilizes two aggregation strategies. Specifically, it first refines propagated voxel features to reduce warping artifacts before temporal accumulation, and then selectively injects instance-level query features encoding dynamic-object semantics into the corresponding occupied voxel regions, preserving temporally consistent modeling while strengthening dynamic object representations. Unlocking effective dense voxel streaming, StreamOcc achieves state-of-the-art performance on SurroundOcc-benchmark and Occ3D-nuScenes under real-time constraints, outperforming the prior best methods by +1.3/2.5 and +1.5/2.0 in (overall/dynamic object) mIoU, respectively, while running at 83.3 ms per frame with only 2.8 GB of memory. The project page is available at https://moonseokha.github.io/StreamOcc/.
翻译:暂无翻译