The difficulty of monitoring biodiversity at fine scales and over large areas limits ecological knowledge and conservation efforts. To fill this gap, Species Distribution Models (SDMs) predict species across space from spatially explicit features. Yet, they face the challenge of integrating the rich but heterogeneous data made available over the past decade, notably millions of opportunistic species observations and standardized surveys, as well as multimodal remote sensing data. In light of that, we have designed and developed a new European-scale dataset for SDMs at high spatial resolution (10--50m), including more than 10k species (i.e., most of the European flora). The dataset comprises 5M heterogeneous Presence-Only records and 90k exhaustive Presence-Absence survey records, all accompanied by diverse environmental rasters (e.g., elevation, human footprint, and soil) traditionally used in SDMs. In addition, it provides Sentinel-2 RGB and NIR satellite images with 10 m resolution, a 20-year time series of climatic variables, and satellite time series from the Landsat program. In addition to the data, we provide an openly accessible SDM benchmark (hosted on Kaggle), which has already attracted an active community and a set of strong baselines for single predictor/modality and multimodal approaches. All resources, e.g., the dataset, pre-trained models, and baseline methods (in the form of notebooks), are available on Kaggle, allowing one to start with our dataset literally with two mouse clicks.
翻译:在精细尺度和大范围区域内监测生物多样性的困难限制了生态学认知与保护工作的开展。为填补这一空白,物种分布模型通过空间显式特征预测物种在空间上的分布。然而,如何整合过去十年间积累的海量但异构的数据——特别是数百万条机会性物种观测记录和标准化调查数据,以及多模态遥感数据——仍是当前面临的挑战。为此,我们设计并开发了一个新的欧洲尺度高空间分辨率(10–50米)物种分布模型数据集,涵盖超过1万种物种(即欧洲植物区系的大部分)。该数据集包含500万条异质性仅出现记录和9万条详尽的出现-缺失调查记录,所有记录均配有物种分布模型中传统使用的多样化环境栅格数据(如高程、人类足迹和土壤)。此外,数据集还提供10米分辨率的Sentinel-2 RGB与近红外卫星影像、20年时间序列的气候变量,以及来自Landsat计划的卫星时间序列数据。除数据资源外,我们提供了一个可公开访问的物种分布模型基准测试平台(托管于Kaggle),该平台已吸引活跃的研究社区,并为单预测变量/单模态及多模态方法提供了一系列强基线模型。所有资源(包括数据集、预训练模型和以代码笔记本形式提供的基线方法)均可在Kaggle获取,用户仅需两次鼠标点击即可开始使用本数据集。