PreGSU-A Generalized Traffic Scene Understanding Model for Autonomous Driving based on Pre-trained Graph Attention Network

Scene understanding, defined as learning, extraction, and representation of interactions among traffic elements, is one of the critical challenges toward high-level autonomous driving (AD). Current scene understanding methods mainly focus on one concrete single task, such as trajectory prediction and risk level evaluation. Although they perform well on specific metrics, the generalization ability is insufficient to adapt to the real traffic complexity and downstream demand diversity. In this study, we propose PreGSU, a generalized pre-trained scene understanding model based on graph attention network to learn the universal interaction and reasoning of traffic scenes to support various downstream tasks. After the feature engineering and sub-graph module, all elements are embedded as nodes to form a dynamic weighted graph. Then, four graph attention layers are applied to learn the relationships among agents and lanes. In the pre-train phase, the understanding model is trained on two self-supervised tasks: Virtual Interaction Force (VIF) modeling and Masked Road Modeling (MRM). Based on the artificial potential field theory, VIF modeling enables PreGSU to capture the agent-to-agent interactions while MRM extracts agent-to-road connections. In the fine-tuning process, the pre-trained parameters are loaded to derive detailed understanding outputs. We conduct validation experiments on two downstream tasks, i.e., trajectory prediction in urban scenario, and intention recognition in highway scenario, to verify the generalized ability and understanding ability. Results show that compared with the baselines, PreGSU achieves better accuracy on both tasks, indicating the potential to be generalized to various scenes and targets. Ablation study shows the effectiveness of pre-train task design.

翻译：场景理解，即对交通元素间交互的学习、提取与表征，是实现高级自动驾驶的关键挑战之一。当前场景理解方法主要聚焦于轨迹预测和风险等级评估等具体单一任务。尽管这些方法在特定指标上表现良好，但其泛化能力难以适应真实交通复杂性和下游需求多样性。本研究提出PreGSU——一种基于图注意力网络的通用预训练场景理解模型，通过学习交通场景的通用交互与推理机制，支持多种下游任务。经过特征工程与子图模块处理后，所有元素被嵌入为节点以构建动态加权图，随后通过四个图注意力层学习智能体与车道间的关联关系。在预训练阶段，理解模型通过两项自监督任务进行训练：虚拟交互力建模与掩码道路建模。基于人工势场理论，虚拟交互力建模使PreGSU能够捕捉智能体间交互，而掩码道路建模则提取智能体与道路的连接关系。微调过程中，加载预训练参数以生成详细的理解输出。我们通过城市场景轨迹预测和高速场景意图识别两项下游任务开展验证实验，检验模型的泛化能力与理解能力。结果表明，相较于基线模型，PreGSU在两个任务上均取得更优精度，展现出适用于多场景多目标的泛化潜力。消融实验验证了预训练任务设计的有效性。