A Two-Stage Multitask Vision-Language Framework for Explainable Crop Disease Visual Question Answering

Visual question answering (VQA) for crop disease analysis requires accurate visual understanding and reliable language generation. In this work, we present a lightweight and explainable vision-language framework for crop and disease identification from leaf images. The proposed approach integrates a Swin Transformer vision encoder with sequence-to-sequence language decoders. The vision encoder is first trained in a multitask setup for both plant and disease classification, and then frozen while the text decoders are trained, forming a two-stage training strategy that enhances visual representation learning and cross-modal alignment. We evaluate the model on the large-scale Crop Disease Domain Multimodal (CDDM) dataset using both classification and natural language generation metrics. Experimental results demonstrate near-perfect recognition performance, achieving 99.94% plant classification accuracy and 99.06% disease classification accuracy, along with strong BLEU, ROUGE and BERTScore results. Without fine-tuning, the model further generalizes well to the external PlantVillageVQA benchmark, achieving 83.18% micro accuracy in the VQA task. Our lightweight design outperforms larger vision-language baselines while using significantly fewer parameters. Explainability is assessed through Grad-CAM and token-level attribution, providing interpretable visual and textual evidence for predictions. Qualitative results demonstrate robust performance under diverse user-driven queries, highlighting the effectiveness of task-specific visual pretraining and the two-stage training methodology for crop disease visual question answering. An interactive demo of the proposed Swin-T5 model is publicly available as a Gradio-based application at https://huggingface.co/spaces/Zahid16/PlantDiseaseVQAwithSwinT5 for community use.

翻译：作物病害分析的视觉问答任务需要精确的视觉理解和可靠的语言生成能力。本研究提出了一种轻量级且可解释的视觉语言框架，用于从叶片图像中识别作物种类及其病害。所提出的方法将 Swin Transformer 视觉编码器与序列到序列语言解码器相结合。视觉编码器首先在多任务设置下进行植物分类和病害分类的联合训练，随后被冻结，同时训练文本解码器，形成一种两阶段训练策略，以增强视觉表征学习和跨模态对齐。我们在大规模作物病害领域多模态数据集上使用分类指标和自然语言生成指标评估该模型。实验结果表明，模型取得了近乎完美的识别性能：植物分类准确率达到99.94%，病害分类准确率达到99.06%，同时在BLEU、ROUGE和BERTScore指标上表现优异。未经微调，该模型在外部PlantVillageVQA基准测试中展现出良好的泛化能力，在VQA任务中取得了83.18%的微平均准确率。我们的轻量级设计在参数量显著减少的情况下，性能超越了更大的视觉语言基线模型。通过Grad-CAM和词元级归因分析对模型可解释性进行评估，为预测结果提供了可解释的视觉和文本证据。定性分析结果表明，模型在多样化的用户驱动查询下均表现出鲁棒的性能，突显了任务特定视觉预训练和两阶段训练方法在作物病害视觉问答任务中的有效性。所提出的Swin-T5模型的交互式演示已作为基于Gradio的应用程序公开提供，访问地址为：https://huggingface.co/spaces/Zahid16/PlantDiseaseVQAwithSwinT5，以供社区使用。