End-to-end autonomous parking has emerged as a critical task within the realm of autonomous driving. However, existing methods suffer from black-box characteristics, lacking high-level semantic understanding and interpretability, which impedes the realization of seamless long-distance autonomous parking from the road to the target spot. To address these limitations, we propose ParkingTransformer, a novel framework that leverages multi-view perception and the scene understanding capability of Large Language Models (LLMs). By combining trajectory queries with LLMs implicit state features, our method interacts directly with historical information and raw sensor data to output planning trajectories, eliminating the need for dense Bird's-View (BEV) representations. To compensate for the inadequate spatial reasoning ability of LLMs, we introduce 3D positional encoding to explicitly inject spatial geometric awareness. Furthermore, a fixed-window streaming mechanism is designed for historical information processing, significantly improving long-term temporal processing efficiency and inference speed. Additionally, a coarse-to-fine decoding strategy is employed to progressively enhance trajectory precision. Extensive closed-loop experiments are conducted on the CARLA simulator and real-world vehicle platforms. The results demonstrate that our method achieves a driving score of 61.32 in CARLA simulator and an average success rate of 88.70% in real-world experiments, validating the feasibility and effectiveness of the proposed algorithms.
翻译:端到端自主泊车已成为自动驾驶领域的关键任务。然而,现有方法存在黑箱特性,缺乏高层语义理解与可解释性,这阻碍了从道路到目标点的无缝长距离自主泊车实现。针对这些局限性,本文提出ParkingTransformer——一种融合多视角感知与大型语言模型(LLMs)场景理解能力的新型框架。通过将轨迹查询与LLMs隐状态特征相结合,本方法可直接与历史信息及原始传感器数据交互以输出规划轨迹,从而无需密集的鸟瞰图(BEV)表示。为弥补LLMs空间推理能力不足,引入三维位置编码以显式注入空间几何感知。此外,设计固定窗口流式机制用于历史信息处理,显著提升长时域处理效率与推理速度。进一步采用从粗到精的解码策略逐步增强轨迹精度。在CARLA模拟器与真实车辆平台上开展了大量闭环实验。结果表明,本方法在CARLA模拟器中获得61.32的驾驶评分,在真实实验中达到88.70%的平均成功率,验证了所提算法的可行性与有效性。