Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical for understanding ocean health and climate feedbacks. Existing classification models perform well on individual collections but fail to generalize across instruments and environments due to isolated training datasets and inconsistent labels. To address this, we introduce Planktonzilla-17M, a unified dataset consolidating publicly available plankton image collections spanning thirteen imaging systems. It comprises 17.4 million images with standardized taxonomy and geo-environmental metadata, including 3.74 million plankton images spanning over 602 taxonomic classes, of which 201 are identified at the species level, making it the largest and most comprehensive plankton image dataset to date. Using this large-scale dataset, we perform a controlled comparison between supervised and CLIP-style image--text training on a shared ViT backbone. We find that a supervised classifier matches or exceeds CLIP-style training when trained using taxonomic lineage as text. We further observe that BioCLIP and BioCLIP2 perform poorly on plankton in zero-shot and few-shot settings. Leveraging Planktonzilla-17M improves plankton classification performance, highlighting the limitations of current biological foundation models in marine imaging domains.
翻译:海洋浮游生物支撑着水生食物网,并在全球二氧化碳封存中发挥关键作用,因此可靠的物种识别对于理解海洋健康和气候反馈至关重要。现有分类模型在单个数据集上表现良好,但由于训练数据集孤立且标签不一致,难以跨仪器和环境泛化。为解决这一问题,我们提出Planktonzilla-17M,一个整合了涵盖13种成像系统的公开浮游生物图像集合的统一数据集。该数据集包含1740万张图像,附有标准化分类学信息和地理环境元数据,其中374万张浮游生物图像跨越超过602个分类类别,201个类别精确到物种级别,是迄今为止最大、最全面的浮游生物图像数据集。利用这一大规模数据集,我们在共享ViT骨干网络上开展了监督学习与CLIP式图像-文本训练的受控对比实验。结果表明,当以分类学谱系作为文本标签时,监督分类器达到或超越了CLIP式训练效果。我们进一步观察到,BioCLIP和BioCLIP2在零样本和少样本场景下对浮游生物表现不佳。借助Planktonzilla-17M可提升浮游生物分类性能,这凸显了当前生物基础模型在海洋成像领域的局限性。