In complex orchard environments, the phenotypic heterogeneity of different apple leaf diseases, characterized by significant variation among lesions, poses a challenge to traditional multi-scale feature fusion methods. These methods only integrate multi-layer features extracted by convolutional neural networks (CNNs) and fail to adequately account for the relationships between local and global features. Therefore, this study proposes a multi-branch recognition framework named CNN-Transformer-CLIP (CT-CLIP). The framework synergistically employs a CNN to extract local lesion detail features and a Vision Transformer to capture global structural relationships. An Adaptive Feature Fusion Module (AFFM) then dynamically fuses these features, achieving optimal coupling of local and global information and effectively addressing the diversity in lesion morphology and distribution. Additionally, to mitigate interference from complex backgrounds and significantly enhance recognition accuracy under few-shot conditions, this study proposes a multimodal image-text learning approach. By leveraging pre-trained CLIP weights, it achieves deep alignment between visual features and disease semantic descriptions. Experimental results show that CT-CLIP achieves accuracies of 97.38% and 96.12% on a publicly available apple disease and a self-built dataset, outperforming several baseline methods. The proposed CT-CLIP demonstrates strong capabilities in recognizing agricultural diseases, significantly enhances identification accuracy under complex environmental conditions, provides an innovative and practical solution for automated disease recognition in agricultural applications.
翻译:在复杂的果园环境中,不同苹果叶片病害的表型异质性(表现为病斑间的显著差异)对传统的多尺度特征融合方法提出了挑战。这些方法仅整合了卷积神经网络(CNN)提取的多层特征,未能充分考虑局部与全局特征之间的关系。为此,本研究提出了一种名为CNN-Transformer-CLIP(CT-CLIP)的多分支识别框架。该框架协同利用CNN提取局部病斑细节特征,并利用Vision Transformer捕获全局结构关系。随后,自适应特征融合模块(AFFM)动态融合这些特征,实现局部与全局信息的最优耦合,有效应对病斑形态与分布的多样性。此外,为减轻复杂背景的干扰并显著提升少样本条件下的识别精度,本研究提出了一种多模态图文学习方法。通过利用预训练的CLIP权重,实现了视觉特征与病害语义描述之间的深度对齐。实验结果表明,CT-CLIP在公开的苹果病害数据集及自建数据集上分别达到了97.38%和96.12%的准确率,优于多种基线方法。所提出的CT-CLIP展现了在农业病害识别方面的强大能力,显著提升了复杂环境条件下的识别精度,为农业应用中的自动化病害识别提供了一种创新且实用的解决方案。