This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.
翻译:本文通过一系列实验,全面分析了可解释的事实核查,重点关注大语言模型验证公共卫生主张并为其真实性评估提供解释或理由的能力。我们研究了零/少样本提示和参数高效微调在各种开源和闭源模型中的效果,考察了它们在真实性预测和解释生成的独立任务及联合任务中的表现。重要的是,我们采用了双重评估方法,包括先前建立的自动评估指标和通过人工评估设定的新颖标准。我们的自动评估表明,在零样本场景中,GPT-4表现突出,但在少样本和参数高效微调情境下,开源模型不仅展现了缩小性能差距的能力,在某些情况下甚至超越了GPT-4。人工评估揭示了更多细微之处,并指出了黄金解释中存在的潜在问题。