This work introduces a novel and adaptable architecture designed for real-time occupancy forecasting that outperforms existing state-of-the-art models on the Waymo Open Motion Dataset in Soft IOU. The proposed model uses recursive latent state estimation with learned transformer-based functions to effectively update and evolve the state. This enables highly efficient real-time inference on embedded systems, as profiled on an Nvidia Xavier AGX. Our model, MotionPerceiver, achieves this by encoding a scene into a latent state that evolves in time through self-attention mechanisms. Additionally, it incorporates relevant scene observations, such as traffic signals, road topology and agent detections, through cross-attention mechanisms. This forms an efficient data-streaming architecture, that contrasts with the expensive, fixed-sequence input common in existing models. The architecture also offers the distinct advantage of generating occupancy predictions through localized querying based on a point-of-interest, as opposed to generating fixed-size occupancy images that render potentially irrelevant regions.
翻译:本文提出了一种新颖且灵活架构,专为实时占用预测设计,在Waymo开放运动数据集的Soft IOU指标上超越了现有最先进模型。该模型采用基于可学习的Transformer函数的递归潜状态估计,有效更新和演化场景状态,从而在嵌入式系统上实现高效实时推理(在Nvidia Xavier AGX上实测验证)。我们的MotionPerceiver模型通过自注意力机制将场景编码为随时间演化的潜在状态,并利用交叉注意力机制融合相关场景观测(如交通信号、道路拓扑和智能体检测)。这一设计形成了高效数据流架构,与现有模型依赖的高成本固定序列输入形成鲜明对比。该架构还具备独特优势:可通过兴趣点进行局部化查询生成占用预测,而非生成可能包含无关区域的固定尺寸占用图像。