PT-WNO: Point Transformer with Wavelet Neural Operator for 3D Point Cloud Semantic Segmentation

Point cloud semantic segmentation requires architectures that capture both fine-grained local geometry and broad global scene structure. Transformer-based networks have demonstrated strong performance by focusing on detailed local feature aggregation; however, global context is conveyed primarily through skip connections across encoder-decoder stages, which we argue is insufficient for full scene understanding. We hypothesize that augmenting skip connections with a learnable global feature extraction module allows the network to acquire scene-level knowledge before descending into local detail, leading to richer and more contextually grounded representations. To this end, we propose Point Transformer with Wavelet Neural Operato (PT-WNO), which integrates a shared Wavelet Neural Operator (WNO) branch alongside the skip connections of a point cloud transformer backbone. At each encoder-decoder transition, point features are projected onto a dense 3D volumetric grid where the WNO captures multi-scale global spectral context through learnable wavelet decomposition and reconstruction. These global features are fused back into the network via lightweight adapters, complementing rather than replacing the existing skip connections. Experiments on four large-scale 3D point cloud benchmarks demonstrate the effectiveness of PT-WNO. On S3DIS (Area 5), PT-WNO achieves 71.59% mIoU, outperforming the Point Transformer v3 (PTv3) baseline by +1.03 points. On DALES it achieves 81.05% mIoU (+1.47 over the baseline). On ScanNet~v2, PT-WNO obtains 76.19% mIoU, remaining competitive with the baseline (76.36%).

翻译：点云语义分割需要既能捕捉精细局部几何结构又能理解广阔全局场景特征的网络架构。基于Transformer的网络通过聚焦于细致局部特征聚合展现了强劲性能，然而我们认为其全局上下文主要通过编码器-解码器阶段的跨层连接传递，这对于完整场景理解仍显不足。我们假设在跨层连接中引入可学习的全局特征提取模块，能使网络在进入局部细节处理前获取场景级知识，从而生成更丰富且更具语境支撑的表征。为此，我们提出点Transformer与小波神经算子融合网络（PT-WNO），该网络在点云Transformer骨干网络的跨层连接旁集成了共享的小波神经算子（WNO）分支。在每个编码器-解码器过渡阶段，点特征被投影到密集的三维体素网格上，WNO通过可学习的小波分解与重构捕获多尺度全局频谱上下文。这些全局特征通过轻量级适配器回融至网络，对现有跨层连接进行补充而非替代。在四个大规模三维点云基准上的实验验证了PT-WNO的有效性：在S3DIS（Area 5）上，PT-WNO达到71.59% mIoU，较Point Transformer v3（PTv3）基线提升1.03个百分点；在DALES上获得81.05% mIoU（较基线提高1.47个百分点）；在ScanNet v2上取得76.19% mIoU，与基线结果（76.36%）保持竞争力。

相关内容

点云

关注 50

根据激光测量原理得到的点云，包括三维坐标（XYZ）和激光反射强度（Intensity）。根据摄影测量原理得到的点云，包括三维坐标（XYZ）和颜色信息（RGB）。结合激光测量和摄影测量原理得到点云，包括三维坐标（XYZ）、激光反射强度（Intensity）和颜色信息（RGB）。在获取物体表面每个采样点的空间坐标后，得到的是一个点的集合，称之为“点云”(Point Cloud)

Transformer如何做视觉分割？南洋理工最新《基于Transformer的视觉分割》综述，详述120多个深度分割模型

专知会员服务

56+阅读 · 2023年4月27日

【CVPR2022】多视图聚合的大规模三维语义分割

专知会员服务

21+阅读 · 2022年4月20日

【CVPR2021】基于Transformers 从序列到序列的角度重新思考语义分割

专知会员服务

44+阅读 · 2021年3月15日

基于深度学习的点云语义分割研究综述

专知会员服务

75+阅读 · 2021年1月16日