Autonomous driving without high-definition (HD) maps demands a higher level of active scene understanding. In this competition, the organizers provided the multi-perspective camera images and standard-definition (SD) maps to explore the boundaries of scene reasoning capabilities. We found that most existing algorithms construct Bird's Eye View (BEV) features from these multi-perspective images and use multi-task heads to delineate road centerlines, boundary lines, pedestrian crossings, and other areas. However, these algorithms perform poorly at the far end of roads and struggle when the primary subject in the image is occluded. Therefore, in this competition, we not only used multi-perspective images as input but also incorporated SD maps to address this issue. We employed map encoder pre-training to enhance the network's geometric encoding capabilities and utilized YOLOX to improve traffic element detection precision. Additionally, for area detection, we innovatively introduced LDTR and auxiliary tasks to achieve higher precision. As a result, our final OLUS score is 0.58.
翻译:在没有高精地图的情况下实现自动驾驶,对主动场景理解能力提出了更高要求。本次竞赛中,主办方提供了多视角相机图像与标准地图,旨在探索场景推理能力的边界。我们发现,现有算法大多从多视角图像构建鸟瞰图特征,并利用多任务头来描绘道路中心线、边界线、人行横道及其他区域。然而,这些算法在道路远端表现不佳,且在图像主体被遮挡时难以有效工作。因此,在本竞赛中,我们不仅将多视角图像作为输入,还引入了标准地图以应对此问题。我们采用地图编码器预训练来增强网络的几何编码能力,并利用 YOLOX 提升交通要素检测精度。此外,针对区域检测任务,我们创新性地引入了 LDTR 及辅助任务以实现更高精度。最终,我们的 OLUS 得分为 0.58。