This paper presents MinkUNeXt, an effective and efficient architecture for place-recognition from point clouds entirely based on the new 3D MinkNeXt Block, a residual block composed of 3D sparse convolutions that follows the philosophy established by recent Transformers but purely using simple 3D convolutions. Feature extraction is performed at different scales by a U-Net encoder-decoder network and the feature aggregation of those features into a single descriptor is carried out by a Generalized Mean Pooling (GeM). The proposed architecture demonstrates that it is possible to surpass the current state-of-the-art by only relying on conventional 3D sparse convolutions without making use of more complex and sophisticated proposals such as Transformers, Attention-Layers or Deformable Convolutions. A thorough assessment of the proposal has been carried out using the Oxford RobotCar and the In-house datasets. As a result, MinkUNeXt proves to outperform other methods in the state-of-the-art.
翻译:本文提出MinkUNeXt,一种完全基于新型三维MinkNeXt模块的高效点云地点识别架构。该残差模块由三维稀疏卷积组成,遵循近期Transformer的设计理念,但纯粹采用简单三维卷积实现。通过U-Net编码器-解码器网络在不同尺度进行特征提取,并采用广义平均池化(GeM)将多尺度特征聚合为单一描述符。所提架构证明,仅依赖传统三维稀疏卷积而不采用Transformer、注意力层或可变形卷积等复杂方案即可超越当前最优方法。使用Oxford RobotCar与In-house数据集进行了全面评估,结果表明MinkUNeXt在性能上优于现有其他方法。