CurveFormer++: 3D Lane Detection by Curve Propagation with Temporal Curve Queries and Attention

In autonomous driving, 3D lane detection using monocular cameras is an important task for various downstream planning and control tasks. Recent CNN and Transformer approaches usually apply a two-stage scheme in the model design. The first stage transforms the image feature from a front image into a bird's-eye-view (BEV) representation. Subsequently, a sub-network processes the BEV feature map to generate the 3D detection results. However, these approaches heavily rely on a challenging image feature transformation module from a perspective view to a BEV representation. In our work, we present CurveFormer++, a single-stage Transformer-based method that does not require the image feature view transform module and directly infers 3D lane detection results from the perspective image features. Specifically, our approach models the 3D detection task as a curve propagation problem, where each lane is represented by a curve query with a dynamic and ordered anchor point set. By employing a Transformer decoder, the model can iteratively refine the 3D lane detection results. A curve cross-attention module is introduced in the Transformer decoder to calculate similarities between image features and curve queries of lanes. To handle varying lane lengths, we employ context sampling and anchor point restriction techniques to compute more relevant image features for a curve query. Furthermore, we apply a temporal fusion module that incorporates selected informative sparse curve queries and their corresponding anchor point sets to leverage historical lane information. In the experiments, we evaluate our approach for the 3D lane detection task on two publicly available real-world datasets. The results demonstrate that our method provides outstanding performance compared with both CNN and Transformer based methods. We also conduct ablation studies to analyze the impact of each component in our approach.

翻译：在自动驾驶领域，基于单目摄像头的3D车道检测是多种下游规划与控制任务的重要基础。当前基于CNN和Transformer的方法通常在模型设计中采用两阶段方案：第一阶段将前视图像特征转换为鸟瞰图（BEV）表示，第二阶段通过子网络处理BEV特征图生成3D检测结果。然而，这类方法高度依赖从透视视图到BEV表示的图像特征变换模块，该任务极具挑战性。本文提出CurveFormer++——一种无需图像特征视图变换模块的单阶段Transformer方法，可直接从透视图像特征推断3D车道检测结果。具体而言，我们将3D检测任务建模为曲线传播问题，每条车道由包含动态有序锚点集的曲线查询表示。通过Transformer解码器，模型可迭代优化3D车道检测结果。在解码器中引入曲线交叉注意力模块，用于计算图像特征与车道曲线查询之间的相似度。为应对不同车道长度，我们采用上下文采样与锚点限制技术，为曲线查询提取更相关的图像特征。此外，我们设计时序融合模块，通过整合选定的稀疏曲线查询及其对应锚点集，有效利用历史车道信息。实验环节中，我们在两个公开真实世界数据集上评估了该方法在3D车道检测任务中的表现。结果表明，与基于CNN和Transformer的方法相比，本方法均展现出卓越性能。同时通过消融实验，系统分析了各组件对模型的影响。