This study compares the performance of (1) fine-tuned models and (2) extremely large language models on the task of check-worthy claim detection. For the purpose of the comparison we composed a multilingual and multi-topical dataset comprising texts of various sources and styles. Building on this, we performed a benchmark analysis to determine the most general multilingual and multi-topical claim detector. We chose three state-of-the-art models in the check-worthy claim detection task and fine-tuned them. Furthermore, we selected three state-of-the-art extremely large language models without any fine-tuning. We made modifications to the models to adapt them for multilingual settings and through extensive experimentation and evaluation. We assessed the performance of all the models in terms of accuracy, recall, and F1-score in in-domain and cross-domain scenarios. Our results demonstrate that despite the technological progress in the area of natural language processing, the models fine-tuned for the task of check-worthy claim detection still outperform the zero-shot approaches in a cross-domain settings.
翻译:本研究比较了(1)微调模型与(2)超大规模语言模型在值得核查声明检测任务中的性能。为进行对比,我们构建了一个包含多语言、多主题且涵盖多种来源与风格文本的数据集。在此基础上,我们开展了基准分析以确定最具普适性的多语言多主题声明检测器。我们选取了三项在值得核查声明检测任务中表现最优的模型进行微调,同时选择了三个未经微调的最优超大规模语言模型。通过大量实验与评估,我们对模型进行了多语言场景适应性改造。我们评估了所有模型在域内与跨域场景下的准确率、召回率和F1分数。结果表明,尽管自然语言处理技术取得了长足进步,针对值得核查声明检测任务微调的模型在跨域场景下仍优于零样本方法。