Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
翻译:视觉-语言模型(VLMs)正日益被提出作为视觉识别任务的通用解决方案,但其在农业决策支持中的可靠性仍鲜为人知。我们在AgML数据集(https://github.com/Project-AgML)的27个农业图像分类数据集上对一系列开源和闭源VLMs进行了基准测试,涵盖植物病害、害虫与损害识别以及植物与杂草物种鉴定等162个类别、共计248,000张图像。在所有任务中,零样本VLMs的表现显著低于监督式任务专用基线模型(YOLO11),后者始终以明显高于任何基础模型的准确率保持领先。在多项选择提示下,表现最佳的VLM(Gemini-3 Pro)达到约62%的平均准确率,而开放式提示的性能则低得多,原始准确率通常低于25%。应用基于LLM的语义判断可提升开放式准确率(例如顶级模型从约21%提升至约30%)并改变模型排名,表明评估方法会实质性地影响报告结论。在开源模型中,Qwen-VL-72B表现最佳,在受限提示下接近闭源模型性能,但仍落后于顶级专有系统。任务层级分析显示,植物与杂草物种分类的难度始终低于害虫与损害识别,后者在所有模型中都是最具挑战性的类别。总体而言,这些结果表明当前现成的VLMs尚不适合作为独立的农业诊断系统,但通过与受限交互界面、显式标签本体和领域感知评估策略相结合,可作为辅助组件发挥作用。