Stranger Danger! Identifying and Avoiding Unpredictable Pedestrians in RL-based Social Robot Navigation

Reinforcement learning (RL) methods for social robot navigation show great success navigating robots through large crowds of people, but the performance of these learning-based methods tends to degrade in particularly challenging or unfamiliar situations due to the models' dependency on representative training data. To ensure human safety and comfort, it is critical that these algorithms handle uncommon cases appropriately, but the low frequency and wide diversity of such situations present a significant challenge for these data-driven methods. To overcome this challenge, we propose modifications to the learning process that encourage these RL policies to maintain additional caution in unfamiliar situations. Specifically, we improve the Socially Attentive Reinforcement Learning (SARL) policy by (1) modifying the training process to systematically introduce deviations into a pedestrian model, (2) updating the value network to estimate and utilize pedestrian-unpredictability features, and (3) implementing a reward function to learn an effective response to pedestrian unpredictability. Compared to the original SARL policy, our modified policy maintains similar navigation times and path lengths, while reducing the number of collisions by 82% and reducing the proportion of time spent in the pedestrians' personal space by up to 19 percentage points for the most difficult cases. We also describe how to apply these modifications to other RL policies and demonstrate that some key high-level behaviors of our approach transfer to a physical robot.

翻译：基于强化学习（RL）的社交机器人导航方法在引导机器人穿越大规模人群方面取得了显著成功，但这些基于学习的方法在遇到特别具有挑战性或陌生情境时，其性能往往会下降，这源于模型对代表性训练数据的依赖性。为确保人类的安全与舒适，这些算法能否妥善处理罕见情况至关重要，然而此类情境的低频性和高度多样性给这些数据驱动方法带来了重大挑战。为克服这一挑战，我们提出了对学习过程的改进，以鼓励这些RL策略在陌生情境中保持额外谨慎。具体而言，我们通过以下方式改进了社会注意力强化学习（SARL）策略：（1）修改训练过程，系统性地在行人模型中引入偏差；（2）更新价值网络，使其能够估计并利用行人不可预测性特征；（3）设计奖励函数，以学习对行人不可预测性的有效响应。与原始SARL策略相比，我们改进后的策略在保持相近导航时间和路径长度的同时，将碰撞次数减少了82%，并在最困难的情况下，将处于行人个人空间内的时间比例降低了多达19个百分点。我们还阐述了如何将这些改进应用于其他RL策略，并证明了我们方法中的一些关键高层行为能够迁移到物理机器人上。