DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models

from arxiv, This work was submitted without the consent of my current adviser. Additionally, it overlaps with my unpublished research work. In order to avoid potential academic and authorship conflicts, I am requesting withdrawal of the paper

Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related classes. Its outputs are then used to drive reasoning by several LLMs. To judge this reasoning, we build two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care. Thirty-two LLMs are evaluated against these gold standards. Strong classification improves the quality of their explanations, but none of the models reach human-level stability. Even the best LLMs change their reasoning when prompts vary. Our study shows that combining DL with LLMs can produce useful clinical narratives, but current LLMs remain unreliable for high-stakes medical decisions. The framework provides a clearer view of their limits and a path for building safer reasoning systems. The complete source code and datasets used in this study are available at https://github.com/souravbasakshuvo/DL3M.

翻译：医学图像分类器能够有效检测胃肠道疾病，但无法解释其决策依据。大语言模型能够生成临床文本，但在视觉推理方面存在困难，且常产生不稳定或错误的解释。这导致模型所见与临床医生所期望的推理类型之间存在差距。本文提出一个将图像分类与结构化临床推理相连接的框架。我们设计了一种新型混合模型MobileCoAtNet，专用于内窥镜图像，并在八个胃部相关类别上实现了高精度分类。其输出结果随后用于驱动多个大语言模型进行推理。为评估此类推理质量，我们构建了两个经专家验证的基准数据集，涵盖病因、症状、治疗、生活方式及后续护理等方面。基于这些黄金标准，我们对32个大语言模型进行了评估。强有力的分类能力提升了模型解释的质量，但所有模型均未达到人类水平的稳定性。即使表现最佳的大语言模型，也会因提示词变化而改变其推理过程。研究表明，结合深度学习与大语言模型能够生成有用的临床叙述，但当前的大语言模型对于高风险医疗决策仍不可靠。该框架为揭示其局限性提供了更清晰的视角，并为构建更安全的推理系统指明了路径。本研究所用完整源代码及数据集已公开于https://github.com/souravbasakshuvo/DL3M。