Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.

翻译：视觉语言模型（VLM）正被越来越多地提出作为视觉识别任务的通用解决方案，但其在农业决策支持中的可靠性仍缺乏深入理解。我们利用AgML数据库（https://github.com/Project-AgML）中涵盖162个类别、248,000张图像的27个农业图像分类数据集（涉及植物病害、虫害损伤、植物与杂草物种识别），对多种开源与闭源VLM进行了系统基准测试。在所有任务中，零样本VLM的性能显著低于任务特定的有监督基线模型（YOLO11），后者在所有基础模型上始终取得明显更高的准确率。在多项选择提示下，最佳VLM（Gemini-3 Pro）的平均准确率约为62%，而开放式提示的原始准确率普遍低于25%。基于LLM的语义评判方法可将开放式准确率（例如顶尖模型从约21%提升至约30%）并改变模型排名，表明评估方法论会实质性地影响报告结论。在开源模型中，Qwen-VL-72B表现最佳，在受限提示条件下接近闭源模型性能，但仍落后于顶级专有系统。任务级分析显示，植物与杂草物种分类任务始终比虫害损伤识别更容易，后者仍是各模型面临的最具挑战性类别。综合结果表明，当前现成的VLM尚不适合作为独立的农业诊断系统，但若配合受限接口、显式标签本体和领域感知评估策略，可发挥辅助组件的作用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

[ICML 2026] 看见的还是思考的？用奖励机制区分“看错”与“想错”：视觉语言模型奖励感知

专知会员服务

10+阅读 · 5月15日

在无标注条件下适配视觉—语言模型：全面综述

专知会员服务

13+阅读 · 2025年8月9日

面向视觉语言模型的持续学习：遗忘之外的综述与分类体系

专知会员服务

21+阅读 · 2025年8月9日

【ICCV2025】具有局部对齐视觉-语言模型的可解释零样本学习

专知会员服务

10+阅读 · 2025年7月1日