Physics-Informed Reinforcement Learning of Spatial Density Velocity Potentials for Map-Free Racing

Autonomous racing without prebuilt maps is a grand challenge for embedded robotics that requires kinodynamic planning from instantaneous sensor data at the acceleration and tire friction limits. Out-Of-Distribution (OOD) generalization to various racetrack configurations utilizes Machine Learning (ML) to encode the mathematical relation between sensor data and vehicle actuation for end-to-end control, with implicit localization. These comprise Behavioral Cloning (BC) that is capped to human reaction times and Deep Reinforcement Learning (DRL) which requires large-scale collisions for comprehensive training that can be infeasible without simulation but is arduous to transfer to reality, thus exhibiting greater performance than BC in simulation, but actuation instability on hardware. This paper presents a DRL method that parameterizes nonlinear vehicle dynamics from the spectral distribution of depth measurements with a non-geometric, physics-informed reward, to infer vehicle time-optimal and overtaking racing controls with an Artificial Neural Network (ANN) that utilizes less than 1% of the computation of BC and model-based DRL. Slaloming from simulation to reality transfer and variance-induced conservatism are eliminated with the combination of a physics engine exploit-aware reward and the replacement of an explicit collision penalty with an implicit truncation of the value horizon. The policy outperforms human demonstrations by 12% in OOD tracks on proportionally scaled hardware, by maximizing the friction circle with tire dynamics that resemble an empirical Pacejka tire model. System identification illuminates a functional bifurcation where the first layer compresses spatial observations to extract digitized track features with higher resolution in corner apexes, and the second encodes nonlinear dynamics.

翻译：无预建地图的自主赛车是嵌入式机器人领域的一项重大挑战，它要求在加速度和轮胎摩擦极限下，根据瞬时传感器数据进行运动动力学规划。针对不同赛道配置的分布外泛化能力利用机器学习，通过隐式定位，编码传感器数据与车辆执行器之间的数学关系，实现端到端控制。这些方法包括受限于人类反应时间的行为克隆，以及需要大规模碰撞进行综合训练的深度强化学习——后者虽在仿真中比行为克隆性能更优，但若无仿真则难以实现，且向现实迁移困难，导致硬件执行不稳定。本文提出一种深度强化学习方法，该方法通过非几何的物理信息奖励，从深度测量的频谱分布中参数化非线性车辆动力学，利用计算量不足行为克隆和基于模型的深度强化学习1%的人工神经网络，推断车辆的时间最优和超车赛车控制。通过结合物理引擎漏洞感知奖励，并用隐式截断值函数地平线替代显式碰撞惩罚，消除了从仿真到现实的迁移以及方差导致的保守性。在按比例缩放的硬件上，该策略在分布外赛道上的表现优于人类演示12%，其通过最大化摩擦圆实现，轮胎动力学类似于经验性的Pacejka轮胎模型。系统辨识揭示了一种功能分岔：第一层压缩空间观测以提取数字化赛道特征，在弯道顶点处具有更高分辨率；第二层则编码非线性动力学。