In this paper, we explore the challenges associated with establishing an end-to-end fact-checking pipeline in a real-world context, covering over 90 languages. Our real-world experimental benchmarks demonstrate that fine-tuning Transformer models specifically for fact-checking tasks, such as claim detection and veracity prediction, provide superior performance over large language models (LLMs) like GPT-4, GPT-3.5-Turbo, and Mistral-7b. However, we illustrate that LLMs excel in generative tasks such as question decomposition for evidence retrieval. Through extensive evaluation, we show the efficacy of fine-tuned models for fact-checking in a multilingual setting and complex claims that include numerical quantities.
翻译:本文探讨了在实际场景中构建覆盖90多种语言的端到端事实核查流水线所面临的挑战。我们的实际实验基准表明,专门针对事实核查任务(如声明检测和真实性预测)微调Transformer模型,其性能优于GPT-4、GPT-3.5-Turbo和Mistral-7b等大型语言模型。然而,我们指出大型语言模型在生成式任务中表现卓越,例如为证据检索进行问题分解。通过广泛评估,我们展示了微调模型在多语言环境及包含数值信息的复杂声明事实核查中的有效性。