Global tree species mapping using remote sensing data is vital for biodiversity monitoring, forest management, and ecological research. However, progress in this field has been constrained by the scarcity of large-scale, labeled datasets. To address this, we introduce GlobalGeoTree, a comprehensive global dataset for tree species classification. GlobalGeoTree comprises 6.3 million geolocated tree occurrences, spanning 275 families, 2,734 genera, and 21,001 species across the hierarchical taxonomic levels. Each sample is paired with Sentinel-2 image time series and 27 auxiliary environmental variables, encompassing bioclimatic, geographic, and soil data. The dataset is partitioned into GlobalGeoTree-6M for model pretraining and curated evaluation subsets, primarily GlobalGeoTree-10kEval for zero-shot and few-shot benchmarking. To demonstrate the utility of the dataset, we introduce a baseline model, GeoTreeCLIP, which leverages paired remote sensing data and taxonomic text labels within a vision-language framework pretrained on GlobalGeoTree-6M. Experimental results show that GeoTreeCLIP achieves substantial improvements in zero- and few-shot classification on GlobalGeoTree-10kEval over existing advanced models. By making the dataset, models, and code publicly available, we aim to establish a benchmark to advance tree species classification and foster innovation in biodiversity research and ecological applications.
翻译:利用遥感数据进行全球树种制图对于生物多样性监测、森林管理和生态研究至关重要。然而,该领域的进展一直受限于大规模标注数据集的稀缺。为此,我们推出了GlobalGeoTree,一个用于树种分类的综合性全球数据集。GlobalGeoTree包含630万个地理定位的树木出现记录,涵盖从科到种的层级分类单元,涉及275个科、2,734个属和21,001个物种。每个样本均配有时序Sentinel-2卫星影像以及27个辅助环境变量,包括生物气候、地理和土壤数据。该数据集被划分为用于模型预训练的GlobalGeoTree-6M以及精心构建的评估子集,其中主要为用于零样本和小样本基准测试的GlobalGeoTree-10kEval。为展示该数据集的实用性,我们提出了一个基线模型GeoTreeCLIP,该模型在GlobalGeoTree-6M上预训练,利用配对的遥感数据与分类学文本标签,在一个视觉-语言框架中进行学习。实验结果表明,在GlobalGeoTree-10kEval上,GeoTreeCLIP在零样本和小样本分类任务中相比现有先进模型取得了显著提升。通过公开数据集、模型和代码,我们旨在建立一个基准,以推动树种分类研究,并促进生物多样性研究和生态应用领域的创新。