Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs: Phi-3-mini (3.8B), Qwen2.5-3B, and Mistral-7B, via QLoRA on SciFact and HealthVer, providing the first study of QLoRA models against GPT-4o and fine-tuned BioLinkBERT encoders. Mistral-7B QLoRA surpasses both GPT-4o and GPT-5 (up to 12% F1 gain) at a fractional cost using just 1,008 training examples. We conduct extensive in-domain and cross-domain evaluation: models trained on SciFact tested on HealthVer and vice versa, at matched sizes to isolate dataset structure from data quantity. We identify a previously unreported structural artifact in SciFact that inflates in-domain scores, and show through bidirectional out-of-domain evaluation that training on structurally sound data enables robust cross-domain transfer. We plan to release all code and adapter checkpoints.
翻译:大型语言模型(如GPT-4o和GPT-5)在生物医学声明验证任务中表现出强大的零样本性能,但高昂的成本和不透明性限制了其规模化应用。我们采用QLoRA方法对三个小型语言模型(Phi-3-mini (3.8B)、Qwen2.5-3B和Mistral-7B)在SciFact和HealthVer数据集上进行微调,首次系统比较了QLoRA模型与GPT-4o及微调后的BioLinkBERT编码器的性能。其中,Mistral-7B QLoRA模型在仅使用1,008个训练样本的情况下,以极低的成本超越了GPT-4o和GPT-5(F1值最高提升12%)。我们开展了广泛的域内与跨域评估:将在SciFact上训练的模型在HealthVer上测试,反之亦然,并通过匹配模型规模来隔离数据集结构与数据量对结果的影响。研究发现SciFact数据集存在此前未被报道的结构性伪影,该伪影会人为提升域内性能分数;双向跨域评估表明,基于结构可靠数据训练的模型能够实现稳健的跨域迁移。我们将公开所有代码与适配器检查点。