Practitioners deploying small open-weight large language models (LLMs) for medical question answering face a recurring design choice: invest in a domain-fine-tuned model, or keep a general-purpose model and inject domain knowledge at inference time via retrieval-augmented generation (RAG). We isolate this trade-off by holding model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol fixed, and varying only (i) whether the model has been domain-adapted (Gemma 3 4B vs. MedGemma 4B, both 4-bit quantized and served via Ollama) and (ii) whether retrieved passages from a medical knowledge corpus are inserted into the prompt. We evaluate all four cells of this 2x2 design on the full MedQA-USMLE 4-option test split (1,273 questions) with three repetitions per question (15,276 LLM calls). Domain fine-tuning yields a +6.8 percentage-point gain in majority-vote accuracy over the general 4B baseline (53.3% vs. 46.4%, McNemar p < 10^-4). RAG over MedMCQA explanations does not produce a statistically significant gain in either model, and in the domain-tuned model the point estimate is slightly negative (-1.9 pp, p = 0.16). At this scale and on this benchmark, domain knowledge encoded in weights dominates domain knowledge supplied in context. We release the full experiment code and JSONL traces to support replication.
翻译:面向医学问答场景部署小型开源大语言模型的实践者常面临一个基础设计选择:是采用领域微调模型,还是在推理阶段通过检索增强生成(RAG)向通用模型注入领域知识。本研究通过固定模型规模、提示模板、解码温度、检索流程和评估协议,仅改变两个变量来隔离这一权衡:(i)模型是否经过领域适配(Gemma 3 4B vs. MedGemma 4B,均采用4位量化并通过Ollama部署);(ii)是否将医学知识库的检索段落插入提示。我们在完整MedQA-USMLE四选项测试集(1,273道题)上执行2×2设计的所有四个实验单元,每道题重复三次(共15,276次LLM调用)。结果表明,领域微调在多数投票准确率上较通用4B基线提升6.8个百分点(53.3% vs. 46.4%,McNemar检验p<10^-4)。基于MedMCQA解释的RAG在两类模型中均未产生统计显著增益,且在领域微调模型中点估计值略呈负向(-1.9个百分点,p=0.16)。在此规模与基准测试条件下,编码于权重的领域知识优于上下文提供的领域知识。我们已公开完整实验代码与JSONL追踪数据以支持复现。