LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases

Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image--text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy--diseased classification exceeds 90\% accuracy, while fine-grained pathogen and species identification remains below 65\%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.

翻译：基础模型与视觉-语言预训练技术显著推动了视觉-语言模型（VLMs）的发展，使其能够处理视觉与语言模态的数据。然而，由于缺乏大规模、全面的多模态图文数据集与基准测试，此类模型在植物病理学等特定农业领域的应用仍受限。为填补这一空白，我们提出了LeafNet——一个综合性多模态数据集，以及LeafBench——一个为系统评估VLMs理解植物病害能力而开发的视觉问答基准。该数据集包含跨越97个病害类别的186,000张叶片数字图像，并配有元数据，由此生成了涵盖六项关键农业任务的13,950组问答对。这些问题评估植物病理学理解的多个维度，包括视觉症状识别、分类学关联及诊断推理。通过在LeafBench基准上对12个前沿VLM进行测试，我们发现其在病害理解能力上存在显著差异。研究表明，不同任务间的性能表现差异明显：二分类健康-病害识别的准确率超过90%，而细粒度病原体与物种识别任务则低于65%。纯视觉模型与VLMs的直接对比揭示了多模态架构的关键优势：经微调的VLMs优于传统视觉模型，这证实了语言表征的融合能显著提升诊断精度。这些发现凸显了当前VLMs在植物病理学应用中的关键不足，并表明LeafBench可作为推动方法创新与评估进展的严格框架，以实现可靠的AI辅助植物病害诊断。代码发布于https://github.com/EnalisUs/LeafBench。