Sparse algorithms offer great flexibility for multi-view temporal perception tasks. In this paper, we present an enhanced version of Sparse4D, in which we improve the temporal fusion module by implementing a recursive form of multi-frame feature sampling. By effectively decoupling image features and structured anchor features, Sparse4D enables a highly efficient transformation of temporal features, thereby facilitating temporal fusion solely through the frame-by-frame transmission of sparse features. The recurrent temporal fusion approach provides two main benefits. Firstly, it reduces the computational complexity of temporal fusion from $O(T)$ to $O(1)$, resulting in significant improvements in inference speed and memory usage. Secondly, it enables the fusion of long-term information, leading to more pronounced performance improvements due to temporal fusion. Our proposed approach, Sparse4Dv2, further enhances the performance of the sparse perception algorithm and achieves state-of-the-art results on the nuScenes 3D detection benchmark. Code will be available at \url{https://github.com/linxuewu/Sparse4D}.
翻译:稀疏算法为多视角时序感知任务提供了极大的灵活性。本文提出Sparse4D的增强版本,通过实现多帧特征采样的递归形式,改进了时序融合模块。通过有效解耦图像特征与结构化锚点特征,Sparse4D实现了时序特征的高效变换,从而仅通过逐帧传输稀疏特征即可完成时序融合。递归时序融合方法具有两大优势:首先,它将时序融合的计算复杂度从$O(T)$降至$O(1)$,显著提升了推理速度与内存效率;其次,该方法支持长期信息的融合,使时序融合带来的性能提升更为显著。所提出的Sparse4Dv2进一步增强了稀疏感知算法的性能,并在nuScenes三维检测基准上取得了最先进的结果。代码将发布于\url{https://github.com/linxuewu/Sparse4D}。