WildScenes: A Benchmark for 2D and 3D Semantic Segmentation in Large-scale Natural Environments

Kavisha Vidanapathirana,Joshua Knights,Stephen Hausler,Mark Cox,Milad Ramezani,Jason Jooste,Ethan Griffiths,Shaheer Mohamed,Sridha Sridharan,Clinton Fookes,Peyman Moghadam

from arxiv, Under review. The first 3 authors contributed equally

Recent progress in semantic scene understanding has primarily been enabled by the availability of semantically annotated bi-modal (camera and lidar) datasets in urban environments. However, such annotated datasets are also needed for natural, unstructured environments to enable semantic perception for applications, including conservation, search and rescue, environment monitoring, and agricultural automation. Therefore, we introduce WildScenes, a bi-modal benchmark dataset consisting of multiple large-scale traversals in natural environments, including semantic annotations in high-resolution 2D images and dense 3D lidar point clouds, and accurate 6-DoF pose information. The data is (1) trajectory-centric with accurate localization and globally aligned point clouds, (2) calibrated and synchronized to support bi-modal inference, and (3) containing different natural environments over 6 months to support research on domain adaptation. Our 3D semantic labels are obtained via an efficient automated process that transfers the human-annotated 2D labels from multiple views into 3D point clouds, thus circumventing the need for expensive and time-consuming human annotation in 3D. We introduce benchmarks on 2D and 3D semantic segmentation and evaluate a variety of recent deep-learning techniques to demonstrate the challenges in semantic segmentation in natural environments. We propose train-val-test splits for standard benchmarks as well as domain adaptation benchmarks and utilize an automated split generation technique to ensure the balance of class label distributions. The data, evaluation scripts and pretrained models will be released upon acceptance at https://csiro-robotics.github.io/WildScenes.

翻译：近年来的语义场景理解研究主要得益于城市环境中语义标注的双模态（相机与激光雷达）数据集的可用性。然而，自然非结构化环境同样需要此类标注数据集，以支持保护、搜救、环境监测及农业自动化等领域的语义感知应用。为此，我们提出WildScenes——一种双模态基准数据集，包含多组在自然环境中大规模遍历采集的数据，涵盖高分辨率二维图像的语义标注、稠密三维激光雷达点云以及精确的六自由度位姿信息。该数据具有以下特性：（1）以轨迹为中心，提供准确定位与全局对齐的点云；（2）经标定与同步处理以支持双模态推理；（3）跨越6个月收集不同自然环境，助力领域适应研究。我们的三维语义标签通过高效自动化流程生成：将人工标注的二维标签从多视角映射至三维点云，从而规避了昂贵且耗时的三维人工标注需求。我们建立了二维与三维语义分割的基准测试，并评估了多种近期深度学习技术，以揭示自然环境中语义分割的挑战性。针对标准基准测试与领域适应基准测试，我们提出了训练-验证-测试划分方案，并采用自动化划分生成技术确保类别标签分布的均衡性。数据、评估脚本及预训练模型将在论文接收后发布于 https://csiro-robotics.github.io/WildScenes。