GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification

Global tree species mapping using remote sensing data is vital for biodiversity monitoring, forest management, and ecological research. However, progress in this field has been constrained by the scarcity of large-scale, labeled datasets. To address this, we introduce GlobalGeoTree, a comprehensive global dataset for tree species classification. GlobalGeoTree comprises 6.3 million geolocated tree occurrences, spanning 275 families, 2,734 genera, and 21,001 species across the hierarchical taxonomic levels. Each sample is paired with Sentinel-2 image time series and 27 auxiliary environmental variables, encompassing bioclimatic, geographic, and soil data. The dataset is partitioned into GlobalGeoTree-6M for model pretraining and curated evaluation subsets, primarily GlobalGeoTree-10kEval for zero-shot and few-shot benchmarking. To demonstrate the utility of the dataset, we introduce a baseline model, GeoTreeCLIP, which leverages paired remote sensing data and taxonomic text labels within a vision-language framework pretrained on GlobalGeoTree-6M. Experimental results show that GeoTreeCLIP achieves substantial improvements in zero- and few-shot classification on GlobalGeoTree-10kEval over existing advanced models. By making the dataset, models, and code publicly available, we aim to establish a benchmark to advance tree species classification and foster innovation in biodiversity research and ecological applications.

翻译：利用遥感数据进行全球树种制图对于生物多样性监测、森林管理和生态研究至关重要。然而，该领域的进展一直受限于大规模标注数据集的稀缺。为此，我们推出了GlobalGeoTree，一个用于树种分类的综合性全球数据集。GlobalGeoTree包含630万个地理定位的树木出现记录，涵盖从科到种的层级分类单元，涉及275个科、2,734个属和21,001个物种。每个样本均配有时序Sentinel-2卫星影像以及27个辅助环境变量，包括生物气候、地理和土壤数据。该数据集被划分为用于模型预训练的GlobalGeoTree-6M以及精心构建的评估子集，其中主要为用于零样本和小样本基准测试的GlobalGeoTree-10kEval。为展示该数据集的实用性，我们提出了一个基线模型GeoTreeCLIP，该模型在GlobalGeoTree-6M上预训练，利用配对的遥感数据与分类学文本标签，在一个视觉-语言框架中进行学习。实验结果表明，在GlobalGeoTree-10kEval上，GeoTreeCLIP在零样本和小样本分类任务中相比现有先进模型取得了显著提升。通过公开数据集、模型和代码，我们旨在建立一个基准，以推动树种分类研究，并促进生物多样性研究和生态应用领域的创新。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

OpenEarthAgent：一种面向工具增强型地理空间智能体的统一框架

专知会员服务

16+阅读 · 2月20日

重磅！《地球大数据白皮书（2023年）》74页pdf

专知会员服务

60+阅读 · 2023年10月10日

《遥感》书籍三部曲！《遥感数据表征、分类和精度》、《土地资源的遥感监测、建模和制图》《水资源、灾害和城市研究的遥感》

专知会员服务

46+阅读 · 2023年3月23日

中方发布《地球大数据支撑可持续发展目标报告（2022）》

专知会员服务

20+阅读 · 2022年10月2日