Multi-task networks can potentially improve performance and computational efficiency compared to single-task networks, facilitating online deployment. However, current multi-task architectures in point cloud perception combine multiple task-specific point cloud representations, each requiring a separate feature encoder and making the network structures bulky and slow. We propose PAttFormer, an efficient multi-task architecture for joint semantic segmentation and object detection in point clouds that only relies on a point-based representation. The network builds on transformer-based feature encoders using neighborhood attention and grid-pooling and a query-based detection decoder using a novel 3D deformable-attention detection head design. Unlike other LiDAR-based multi-task architectures, our proposed PAttFormer does not require separate feature encoders for multiple task-specific point cloud representations, resulting in a network that is 3x smaller and 1.4x faster while achieving competitive performance on the nuScenes and KITTI benchmarks for autonomous driving perception. Our extensive evaluations show substantial gains from multi-task learning, improving LiDAR semantic segmentation by +1.7% in mIou and 3D object detection by +1.7% in mAP on the nuScenes benchmark compared to the single-task models.
翻译:多任务网络相较于单任务网络能够提升性能与计算效率,从而便于在线部署。然而,当前点云感知中的多任务架构需结合多种任务特定的点云表示,每种表示都需要独立的特征编码器,导致网络结构臃肿且运行缓慢。我们提出PAttFormer——一种仅依赖点表示的、用于点云中语义分割与目标检测联合任务的、高效多任务架构。该网络基于使用邻域注意力与网格池化的Transformer特征编码器,以及采用新型3D可变注意力检测头设计的查询式检测解码器。与其他基于激光雷达的多任务架构不同,我们的PAttFormer无需为多种任务特定的点云表示配备独立的特征编码器,使得网络体积缩小3倍、速度提升1.4倍,同时能在自动驾驶感知领域的nuScenes与KITTI基准测试中取得具有竞争力的性能。大量评估表明,多任务学习带来显著收益:在nuScenes基准上,相较于单任务模型,激光雷达语义分割mIoU提升1.7%,3D目标检测mAP提升1.7%。