Encoding a driving scene into vector representations has been an essential task for autonomous driving that can benefit downstream tasks e.g. trajectory prediction. The driving scene often involves heterogeneous elements such as the different types of objects (agents, lanes, traffic signs) and the semantic relations between objects are rich and diverse. Meanwhile, there also exist relativity across elements, which means that the spatial relation is a relative concept and need be encoded in a ego-centric manner instead of in a global coordinate system. Based on these observations, we propose Heterogeneous Driving Graph Transformer (HDGT), a backbone modelling the driving scene as a heterogeneous graph with different types of nodes and edges. For heterogeneous graph construction, we connect different types of nodes according to diverse semantic relations. For spatial relation encoding, the coordinates of the node as well as its in-edges are in the local node-centric coordinate system. For the aggregation module in the graph neural network (GNN), we adopt the transformer structure in a hierarchical way to fit the heterogeneous nature of inputs. Experimental results show that HDGT achieves state-of-the-art performance for the task of trajectory prediction, on INTERACTION Prediction Challenge and Waymo Open Motion Challenge.
翻译:将驾驶场景编码为向量表示是自动驾驶的基础任务,可惠及轨迹预测等下游任务。驾驶场景通常涉及异构元素(如智能体、车道、交通标志等不同对象类型),对象间的语义关系丰富多样。同时,元素之间还存在相对性,即空间关系是一个相对概念,需以自我为中心而非全局坐标系进行编码。基于这些观察,我们提出异构驾驶图Transformer(HDGT)——一种将驾驶场景建模为包含不同类型节点和边的异构图的骨干网络。在异构构图方面,我们根据多样化的语义关系连接不同类型的节点。在空间关系编码方面,节点及其入边的坐标均采用以节点为中心的局部坐标系。针对图神经网络(GNN)中的聚合模块,我们采用层级式Transformer结构以适应输入的异构特性。实验结果表明,HDGT在INTERACTION预测挑战赛和Waymo开放运动挑战赛的轨迹预测任务中取得了最先进的性能。