This work introduces a flexible architecture for real-time occupancy forecasting. In contrast to existing, more computationally expensive architectures, the proposed model exploits recursive latent state estimation, using learned transformer-based prediction and update modules. This allows for highly efficient real-time inference on an embedded system (profiled on an Nvidia Xavier AGX), and the inclusion of a broad set of information from a diverse set of sensors. The architecture is able to process sparse and occluded observations of agent positions and scene context as this is made available, and does not require motion tracklet inputs. \networkName{} accomplishes this by encoding the scene into a latent state that evolves in time with self-attention and is updated with contextual information such as traffic signals, road topology or agent detections using cross-attention. Occupancy predictions are made by sparsely querying positions of interest as opposed to generating a fixed size raster image, which allows for variable resolution occupancy prediction or local querying by downstream trajectory optimisation algorithms, saving computational effort.
翻译:本文提出了一种用于实时占用预测的灵活架构。与现有计算成本更高的架构不同,该模型利用基于Transformer的递归潜状态估计(包含预测模块与更新模块),实现了高效的嵌入式系统实时推理(在Nvidia Xavier AGX上完成性能分析),并支持整合来自多类传感器的广泛信息。该架构能够处理稀疏且遮挡的智能体位置观测与场景上下文信息(按数据可用性动态处理),且无需轨迹片段输入。\networkName{}通过将场景编码为随时间演化的潜状态(利用自注意力机制)并融合交通信号、道路拓扑或智能体检测等上下文信息(采用交叉注意力)实现上述功能。占用预测通过稀疏查询感兴趣位置完成,而非生成固定尺寸的栅格图像,从而支持可变分辨率占用预测或下游轨迹优化算法的局部查询,显著降低计算开销。