Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
翻译:视觉语言模型(VLM)正被越来越多地提出作为视觉识别任务的通用解决方案,但其在农业决策支持中的可靠性仍缺乏深入理解。我们利用AgML数据库(https://github.com/Project-AgML)中涵盖162个类别、248,000张图像的27个农业图像分类数据集(涉及植物病害、虫害损伤、植物与杂草物种识别),对多种开源与闭源VLM进行了系统基准测试。在所有任务中,零样本VLM的性能显著低于任务特定的有监督基线模型(YOLO11),后者在所有基础模型上始终取得明显更高的准确率。在多项选择提示下,最佳VLM(Gemini-3 Pro)的平均准确率约为62%,而开放式提示的原始准确率普遍低于25%。基于LLM的语义评判方法可将开放式准确率(例如顶尖模型从约21%提升至约30%)并改变模型排名,表明评估方法论会实质性地影响报告结论。在开源模型中,Qwen-VL-72B表现最佳,在受限提示条件下接近闭源模型性能,但仍落后于顶级专有系统。任务级分析显示,植物与杂草物种分类任务始终比虫害损伤识别更容易,后者仍是各模型面临的最具挑战性类别。综合结果表明,当前现成的VLM尚不适合作为独立的农业诊断系统,但若配合受限接口、显式标签本体和领域感知评估策略,可发挥辅助组件的作用。