Streaming perception is a critical task in autonomous driving that requires balancing the latency and accuracy of the autopilot system. However, current methods for streaming perception are limited as they only rely on the current and adjacent two frames to learn movement patterns. This restricts their ability to model complex scenes, often resulting in poor detection results. To address this limitation, we propose LongShortNet, a novel dual-path network that captures long-term temporal motion and integrates it with short-term spatial semantics for real-time perception. LongShortNet is notable as it is the first work to extend long-term temporal modeling to streaming perception, enabling spatiotemporal feature fusion. We evaluate LongShortNet on the challenging Argoverse-HD dataset and demonstrate that it outperforms existing state-of-the-art methods with almost no additional computational cost.
翻译:流式感知是自动驾驶中的一项关键任务,需要平衡自动驾驶系统的延迟与精度。然而,当前的流式感知方法仅依赖当前帧及其相邻两帧来学习运动模式,因此存在局限性。这限制了其对复杂场景的建模能力,常导致检测效果不佳。为克服这一局限,我们提出LongShortNet——一种新颖的双路径网络,该网络能够捕获长期的时间运动信息,并将其与短期的空间语义信息相结合,从而实现实时感知。值得关注的是,LongShortNet是首个将长期时间建模扩展到流式感知领域的工作,实现了时空特征融合。我们在具有挑战性的Argoverse-HD数据集上对LongShortNet进行了评估,结果表明,该方法在几乎不增加额外计算成本的情况下,优于现有最先进的方法。