To ensure safe autonomous driving in urban environments with complex vehicle-pedestrian interactions, it is critical for Autonomous Vehicles (AVs) to have the ability to predict pedestrians' short-term and immediate actions in real-time. In recent years, various methods have been developed to study estimating pedestrian behaviors for autonomous driving scenarios, but there is a lack of clear definitions for pedestrian behaviors. In this work, the literature gaps are investigated and a taxonomy is presented for pedestrian behavior characterization. Further, a novel multi-task sequence to sequence Transformer encoders-decoders (TF-ed) architecture is proposed for pedestrian action and trajectory prediction using only ego vehicle camera observations as inputs. The proposed approach is compared against an existing LSTM encoders decoders (LSTM-ed) architecture for action and trajectory prediction. The performance of both models is evaluated on the publicly available Joint Attention Autonomous Driving (JAAD) dataset, CARLA simulation data as well as real-time self-driving shuttle data collected on university campus. Evaluation results illustrate that the proposed method reaches an accuracy of 81% on action prediction task on JAAD testing data and outperforms the LSTM-ed by 7.4%, while LSTM counterpart performs much better on trajectory prediction task for a prediction sequence length of 25 frames.
翻译:为确保在复杂车辆-行人交互的城市环境中实现安全自动驾驶,自动驾驶车辆(AVs)必须具备实时预测行人短期及即时行为的能力。近年来,针对自动驾驶场景的行人行为估计已开发出多种方法,但行人行为的定义仍缺乏明确性。本研究探讨了现有文献的空白,并提出了一种用于行人行为表征的分类体系。进一步地,本文提出了一种基于Transformer编码器-解码器(TF-ed)的新型多任务序列到序列架构,仅利用自车摄像头观测数据实现行人行为与轨迹预测。将所提方法与现有LSTM编码器-解码器(LSTM-ed)架构进行行为与轨迹预测性能对比。在公开的联合注意力自动驾驶数据集(JAAD)、CARLA仿真数据以及校园内收集的实时代步车数据上对两种模型进行评估。评估结果表明,所提方法在JAAD测试集上的行为预测准确率达到81%,较LSTM-ed提升7.4%;而在25帧预测序列长度的轨迹预测任务中,LSTM模型的表现更优。