Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image--text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy--diseased classification exceeds 90% accuracy, while fine-grained pathogen and species identification remains below 65%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.
翻译:基础模型与视觉-语言预训练技术显著推动了视觉-语言模型(VLMs)的发展,使其能够处理视觉与语言模态的多模态数据。然而,由于缺乏大规模、全面的多模态图像-文本数据集与基准测试,此类模型在农业领域特定任务(如植物病理学)中的应用仍受限。为填补这一空白,我们提出了LeafNet——一个综合性多模态数据集,以及LeafBench——一个为系统评估VLMs在植物病害理解能力而开发的视觉问答基准。该数据集包含涵盖97种病害类别的186,000张叶片数字图像,并配有元数据,由此生成了覆盖六项关键农业任务的13,950组问答对。这些问题评估植物病理学理解的多个维度,包括视觉症状识别、分类学关联及诊断推理。通过对12个前沿VLM模型在LeafBench数据集上进行基准测试,我们揭示了其在病害理解能力上的显著差异。研究表明,模型在不同任务上的表现差异显著:二分类健康-患病判别准确率超过90%,而细粒度病原体与物种识别准确率仍低于65%。纯视觉模型与VLM模型的直接对比证明了多模态架构的关键优势:经微调的VLM模型优于传统视觉模型,这证实了语言表征的融合能显著提升诊断精度。这些发现揭示了当前VLM在植物病理学应用中的关键不足,并凸显了LeafBench作为推动方法创新与评估可靠AI辅助植物病害诊断进展的严谨框架的必要性。代码发布于https://github.com/EnalisUs/LeafBench。