Understanding and predicting pedestrian crossing behavioral intention is crucial for autonomous vehicles driving safety. Nonetheless, challenges emerge when using promising images or environmental context masks to extract various factors for time-series network modeling, causing pre-processing errors or a loss in efficiency. Typically, pedestrian positions captured by onboard cameras are often distorted and do not accurately reflect their actual movements. To address these issues, GTransPDM -- a Graph-embedded Transformer with a Position Decoupling Module -- was developed for pedestrian crossing intention prediction by leveraging multi-modal features. First, a positional decoupling module was proposed to decompose the pedestrian lateral movement and simulate depth variations in the image view. Then, a graph-embedded Transformer was designed to capture the spatial-temporal dynamics of human pose skeletons, integrating essential factors such as position, skeleton, and ego-vehicle motion. Experimental results indicate that the proposed method achieves 92% accuracy on the PIE dataset and 87% accuracy on the JAAD dataset, with a processing speed of 0.05ms. It outperforms the state-of-the-art in comparison.
翻译:理解与预测行人过街行为意图对于自动驾驶车辆的行车安全至关重要。然而,在使用前景图像或环境上下文掩码提取多种因素以进行时序网络建模时,会面临挑战,导致预处理错误或效率损失。通常,车载摄像头捕获的行人位置常存在畸变,无法准确反映其实际运动。为解决这些问题,我们开发了GTransPDM——一种带有位置解耦模块的图嵌入式Transformer,通过利用多模态特征进行行人过街意图预测。首先,提出了一个位置解耦模块,用于分解行人横向运动并模拟图像视图中的深度变化。其次,设计了一个图嵌入式Transformer,以捕捉人体姿态骨架的时空动态,并整合位置、骨架及自车运动等关键因素。实验结果表明,所提方法在PIE数据集上达到92%的准确率,在JAAD数据集上达到87%的准确率,处理速度为0.05毫秒。其性能优于当前最先进方法。