Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark containing 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardized training and evaluation. All models are trained and evaluated under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability. The results reveal distinct performance profiles. CNNs achieve the highest accuracy on lab imagery but degrade under domain shift. Contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance. Generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate accuracy alone.
翻译:可靠的作物病害检测需要模型在不同采集条件下均能保持稳定性能,然而现有评估通常局限于单一架构族或实验室生成数据集。本研究对细粒度作物病害分类的三种模型范式进行了系统性实证比较:卷积神经网络(CNNs)、对比式视觉语言模型(VLMs)以及生成式视觉语言模型。为支持对领域效应的受控分析,我们提出了AgriPath-LF16基准数据集,该数据集包含涵盖16种作物和41种病害的11.1万张图像,并明确区分实验室与田间图像,同时提供一个平衡的3万张图像子集用于标准化训练与评估。所有模型均在统一协议下进行训练与评估,涵盖完整数据集、仅实验室图像及仅田间图像三种训练机制,采用宏观F1分数和解析成功率(PSR)以考量生成式模型的可靠性。结果表明了不同的性能特征:CNNs在实验室图像上达到最高准确率,但在领域偏移下性能下降;对比式VLMs提供了鲁棒且参数高效的替代方案,其跨领域性能具有竞争力;生成式VLMs展现出对分布变化最强的适应能力,尽管其自由文本生成会引入额外的失效模式。这些发现强调,架构选择应基于部署场景而非仅依赖总体准确率。