Sensor fusion approaches for intelligent self-driving agents remain key to driving scene understanding given visual global contexts acquired from input sensors. Specifically, for the local waypoint prediction task, single-modality networks are still limited by strong dependency on the sensitivity of the input sensor, and thus recent works promote the use of multiple sensors in fusion in feature level. While it is well known that multiple data modalities promote mutual contextual exchange, deployment to practical driving scenarios requires global 3D scene understanding in real-time with minimal computations, thus placing greater significance on training strategies given a limited number of practically usable sensors. In this light, we exploit carefully selected auxiliary tasks that are highly correlated with the target task of interest (e.g., traffic light recognition and semantic segmentation) by fusing auxiliary task features and also using auxiliary heads for waypoint prediction based on imitation learning. Our multi-task feature fusion augments and improves the base network, TransFuser, by significant margins for safer and more complete road navigation in CARLA simulator as validated on the Town05 Benchmark through extensive experiments.
翻译:针对智能自动驾驶代理的传感器融合方法仍是利用输入传感器获取的全局视觉上下文理解驾驶场景的关键。具体而言,在局部路点预测任务中,单模态网络仍因高度依赖输入传感器的灵敏度而受限,因此近年研究推动了在特征层面融合多个传感器的应用。尽管多数据模态可促进互相关上下文交换,但实际驾驶场景需在实时计算量最小化条件下实现全局3D场景理解,从而在可实际使用的传感器数量有限的情况下,更凸显训练策略的重要性。基于此,我们通过融合辅助任务特征并采用基于模仿学习的路点预测辅助输出头,精心选取与目标任务高度相关的辅助任务(如交通灯识别和语义分割)。我们的多任务特征融合显著增强并改进了基础网络TransFuser,在CARLA模拟器的Town05基准测试中通过大量实验验证,实现了更安全、更完整的道路导航。