LLM is Not All You Need: A Systematic Evaluation of ML vs. Foundation Models for text and image based Medical Classification

The combination of multimodal Vision-Language Models (VLMs) and Large Language Models (LLMs) opens up new possibilities for medical classification. This work offers a rigorous, unified benchmark by using four publicly available datasets covering text and image modalities (binary and multiclass complexity) that contrasts traditional Machine Learning (ML) with contemporary transformer-based techniques. We evaluated three model classes for each task: Classical ML (LR, LightGBM, ResNet-50), Prompt-Based LLMs/VLMs (Gemini 2.5), and Fine-Tuned PEFT Models (LoRA-adapted Gemma3 variants). All experiments used consistent data splits and aligned metrics. According to our results, traditional machine learning (ML) models set a high standard by consistently achieving the best overall performance across most medical categorization tasks. This was especially true for structured text-based datasets, where the classical models performed exceptionally well. In stark contrast, the LoRA-tuned Gemma variants consistently showed the worst performance across all text and image experiments, failing to generalize from the minimal fine-tuning provided. However, the zero-shot LLM/VLM pipelines (Gemini 2.5) had mixed results; they performed poorly on text-based tasks, but demonstrated competitive performance on the multiclass image task, matching the classical ResNet-50 baseline. These results demonstrate that in many medical categorization scenarios, established machine learning models continue to be the most reliable option. The experiment suggests that foundation models are not universally superior and that the effectiveness of Parameter-Efficient Fine-Tuning (PEFT) is highly dependent on the adaptation strategy, as minimal fine-tuning proved detrimental in this study.

翻译：多模态视觉-语言模型（VLMs）与大型语言模型（LLMs）的结合为医学分类任务开辟了新的可能性。本研究通过使用四个公开可用的涵盖文本与图像模态（包含二分类与多分类复杂度）的数据集，构建了一个严谨、统一的基准，以对比传统机器学习（ML）与基于Transformer的当代技术。我们针对每项任务评估了三类模型：经典机器学习模型（逻辑回归、LightGBM、ResNet-50）、基于提示的LLMs/VLMs（Gemini 2.5）以及经微调的参数高效微调模型（采用LoRA适配的Gemma3变体）。所有实验均采用一致的数据划分与对齐的评估指标。根据我们的结果，传统机器学习模型在大多数医学分类任务中持续取得了最佳的整体性能，设定了很高的基准。这在基于结构化文本的数据集上尤为明显，经典模型表现异常出色。与此形成鲜明对比的是，经LoRA微调的Gemma变体在所有文本与图像实验中均表现最差，未能从提供的极少量微调中实现泛化。然而，零样本LLM/VLM流程（Gemini 2.5）的结果则好坏参半：它们在基于文本的任务上表现不佳，但在多分类图像任务上展现了有竞争力的性能，与经典的ResNet-50基线模型相当。这些结果表明，在许多医学分类场景中，成熟的机器学习模型仍然是最可靠的选择。本实验表明基础模型并非普遍优越，且参数高效微调（PEFT）的有效性高度依赖于适配策略，因为本研究中极少的微调反而产生了负面影响。