Recent advancements in Bird's Eye View (BEV) fusion for map construction have demonstrated remarkable mapping of urban environments. However, their deep and bulky architecture incurs substantial amounts of backpropagation memory and computing latency. Consequently, the problem poses an unavoidable bottleneck in constructing high-resolution (HR) BEV maps, as their large-sized features cause significant increases in costs including GPU memory consumption and computing latency, named diverging training costs issue. Affected by the problem, most existing methods adopt low-resolution (LR) BEV and struggle to estimate the precise locations of urban scene components like road lanes, and sidewalks. As the imprecision leads to risky self-driving, the diverging training costs issue has to be resolved. In this paper, we address the issue with our novel Trumpet Neural Network (TNN) mechanism. The framework utilizes LR BEV space and outputs an up-sampled semantic BEV map to create a memory-efficient pipeline. To this end, we introduce Local Restoration of BEV representation. Specifically, the up-sampled BEV representation has severely aliased, blocky signals, and thick semantic labels. Our proposed Local Restoration restores the signals and thins (or narrows down) the width of the labels. Our extensive experiments show that the TNN mechanism provides a plug-and-play memory-efficient pipeline, thereby enabling the effective estimation of real-sized (or precise) semantic labels for BEV map construction.
翻译:近年来,基于鸟瞰图(BEV)融合的地图构建技术在描绘城市环境方面取得了显著进展。然而,其深度且庞大的架构导致了大量的反向传播内存消耗和计算延迟。因此,在构建高分辨率(HR)BEV地图时,这一问题构成了不可避免的瓶颈:由于大尺寸特征显著增加了包括GPU内存消耗和计算延迟在内的成本,即所谓的"发散性训练成本"问题。受这一问题影响,现有大多数方法采用低分辨率(LR)BEV,难以精确估计道路车道线、人行道等城市场景组件的具体位置。由于这种不精确性可能导致危险的自驾行为,必须解决发散性训练成本问题。本文通过提出的创新性Trumpet神经网络(TNN)机制来应对这一挑战。该框架利用LR BEV空间,并输出上采样的语义BEV地图,从而构建内存高效的流水线。为此,我们引入了BEV表征的局部恢复方法。具体而言,上采样的BEV表征存在严重混叠、块状信号及语义标签粗糙的问题。我们提出的局部恢复方法能恢复信号并细化(或缩小)标签宽度。大量实验表明,TNN机制提供了一种即插即用的内存高效流水线,从而能够有效估计BEV地图构建所需的实际尺寸(即精确)语义标签。