The increasing threat of disinformation calls for automating parts of the fact-checking pipeline. Identifying text segments requiring fact-checking is known as claim detection (CD) and claim check-worthiness detection (CW), the latter incorporating complex domain-specific criteria of worthiness and often framed as a ranking task. Zero- and few-shot LLM prompting is an attractive option for both tasks, as it bypasses the need for labeled datasets and allows verbalized claim and worthiness criteria to be directly used for prompting. We evaluate the LLMs' predictive and calibration accuracy on five CD/CW datasets from diverse domains, each utilizing a different worthiness criterion. We investigate two key aspects: (1) how best to distill factuality and worthiness criteria into a prompt and (2) what amount of context to provide for each claim. To this end, we experiment with varying the level of prompt verbosity and the amount of contextual information provided to the model. Our results show that optimal prompt verbosity is domain-dependent, adding context does not improve performance, and confidence scores can be directly used to produce reliable check-worthiness rankings.
翻译:虚假信息日益增长的威胁要求对事实核查流程的部分环节实现自动化。识别需要事实核查的文本片段被称为"声明检测"(CD)与"声明核查价值检测"(CW),后者融合了领域特定的复杂核查价值标准,通常被构建为排序任务。零样本和少样本的大语言模型提示方法对这两项任务具有吸引力,因为它绕过了对标注数据集的需求,并允许将自然表述的声明与核查价值标准直接用于提示生成。我们评估了LLMs在五个来自不同领域的CD/CW数据集上的预测精度与校准精度,每个数据集采用不同的核查价值标准。研究聚焦两个关键方面:(1)如何将事实性与核查价值标准最佳地提炼为提示模板;(2)为每条声明提供多少上下文信息为宜。为此,我们通过实验调整了提示文本的详细程度以及提供给模型的上下文信息量。结果表明,最优提示详细程度具有领域依赖性,增加上下文并未提升性能,且置信度分数可直接用于生成可靠的核查价值排序。