GeoPlant: Spatial Plant Species Prediction Dataset

The difficulty of monitoring biodiversity at fine scales and over large areas limits ecological knowledge and conservation efforts. To fill this gap, Species Distribution Models (SDMs) predict species across space from spatially explicit features. Yet, they face the challenge of integrating the rich but heterogeneous data made available over the past decade, notably millions of opportunistic species observations and standardized surveys, as well as multimodal remote sensing data. In light of that, we have designed and developed a new European-scale dataset for SDMs at high spatial resolution (10--50m), including more than 10k species (i.e., most of the European flora). The dataset comprises 5M heterogeneous Presence-Only records and 90k exhaustive Presence-Absence survey records, all accompanied by diverse environmental rasters (e.g., elevation, human footprint, and soil) traditionally used in SDMs. In addition, it provides Sentinel-2 RGB and NIR satellite images with 10 m resolution, a 20-year time series of climatic variables, and satellite time series from the Landsat program. In addition to the data, we provide an openly accessible SDM benchmark (hosted on Kaggle), which has already attracted an active community and a set of strong baselines for single predictor/modality and multimodal approaches. All resources, e.g., the dataset, pre-trained models, and baseline methods (in the form of notebooks), are available on Kaggle, allowing one to start with our dataset literally with two mouse clicks.

翻译：在精细尺度和大范围区域内监测生物多样性的困难限制了生态学认知与保护工作的开展。为填补这一空白，物种分布模型通过空间显式特征预测物种在空间上的分布。然而，如何整合过去十年间积累的海量但异构的数据——特别是数百万条机会性物种观测记录和标准化调查数据，以及多模态遥感数据——仍是当前面临的挑战。为此，我们设计并开发了一个新的欧洲尺度高空间分辨率（10–50米）物种分布模型数据集，涵盖超过1万种物种（即欧洲植物区系的大部分）。该数据集包含500万条异质性仅出现记录和9万条详尽的出现-缺失调查记录，所有记录均配有物种分布模型中传统使用的多样化环境栅格数据（如高程、人类足迹和土壤）。此外，数据集还提供10米分辨率的Sentinel-2 RGB与近红外卫星影像、20年时间序列的气候变量，以及来自Landsat计划的卫星时间序列数据。除数据资源外，我们提供了一个可公开访问的物种分布模型基准测试平台（托管于Kaggle），该平台已吸引活跃的研究社区，并为单预测变量/单模态及多模态方法提供了一系列强基线模型。所有资源（包括数据集、预训练模型和以代码笔记本形式提供的基线方法）均可在Kaggle获取，用户仅需两次鼠标点击即可开始使用本数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日