The integration of AI tools into medical applications has aimed to improve the efficiency of diagnosis. The emergence of large language models (LLMs), such as ChatGPT and Claude, has expanded this integration even further despite a concern for their environmental impact. Because of LLM versatility and ease of use through APIs, these larger models are often utilised even though smaller, custom models can be used instead. In this paper, LLMs and small discriminative models are integrated into a Mendix application to detect Covid-19 in chest X-rays. These discriminative models are also used to provide knowledge bases for LLMs to improve accuracy. This provides a benchmark study of 14 different model configurations for comparison of diagnostic accuracy and environmental impact. The findings indicated that while smaller models reduced the carbon footprint of the application, the output was biased towards a positive diagnosis and the output probabilities were lacking confidence. Meanwhile, restricting LLMs to only give probabilistic output caused poor performance in both accuracy and carbon footprint, demonstrating the risk of using LLMs as a universal AI solution. While using the smaller LLM GPT-4.1-Nano reduced the carbon footprint by 94.2% compared to the larger models, this was still disproportionate to the discriminative models; the most efficient solution was the Covid-Net model. Although it had a larger carbon footprint than other small models, its carbon footprint was 99.9% less than when using GPT-4.5-Preview, whilst achieving an accuracy of 95.5%, the highest of all models examined. This paper contributes to knowledge by comparing generative and discriminative models in Covid-19 detection as well as highlighting the environmental risk of using generative tools for classification tasks.
翻译:将人工智能工具整合到医疗应用中旨在提高诊断效率。大型语言模型(LLM)的出现(如ChatGPT和Claude),尽管其对环境的影响令人担忧,却进一步拓展了这一整合。由于LLM的通用性和通过API使用的便捷性,尽管可以采用更小型的定制模型,但这类大型模型仍常被采用。本文中,LLM和小型判别模型被整合到Mendix应用中,用于检测胸部X光中的COVID-19。这些判别模型还被用来为LLM提供知识库以提高其准确性。这提供了14种不同模型配置的基准研究,用于比较诊断准确性和环境影响。研究结果表明,虽然小型模型减少了应用的碳足迹,但其输出偏向于阳性诊断且概率缺乏置信度。同时,限制LLM仅提供概率性输出会导致准确性和碳足迹两个方面的表现不佳,这展示了将LLM作为通用AI解决方案的风险。虽然使用较小的LLM(GPT-4.1-Nano)相比大型模型将碳足迹减少了94.2%,但这仍与判别模型不成比例;最高效的解决方案是Covid-Net模型。尽管其碳足迹大于其他小型模型,但它比使用GPT-4.5-Preview的情况降低了99.9%的碳足迹,同时达到了95.5%的准确率,这是所有检测模型中最高的。本文通过比较生成式模型与判别式模型在COVID-19检测中的应用,揭示了使用生成式工具进行分类任务的环境风险,为相关领域知识做出了贡献。