We present Claim-Dissector: a novel latent variable model for fact-checking and analysis, which given a claim and a set of retrieved evidences jointly learns to identify: (i) the relevant evidences to the given claim, (ii) the veracity of the claim. We propose to disentangle the per-evidence relevance probability and its contribution to the final veracity probability in an interpretable way -- the final veracity probability is proportional to a linear ensemble of per-evidence relevance probabilities. In this way, the individual contributions of evidences towards the final predicted probability can be identified. In per-evidence relevance probability, our model can further distinguish whether each relevant evidence is supporting (S) or refuting (R) the claim. This allows to quantify how much the S/R probability contributes to the final verdict or to detect disagreeing evidence. Despite its interpretable nature, our system achieves results competitive with state-of-the-art on the FEVER dataset, as compared to typical two-stage system pipelines, while using significantly fewer parameters. It also sets new state-of-the-art on FAVIQ and RealFC datasets. Furthermore, our analysis shows that our model can learn fine-grained relevance cues while using coarse-grained supervision, and we demonstrate it in 2 ways. (i) We show that our model can achieve competitive sentence recall while using only paragraph-level relevance supervision. (ii) Traversing towards the finest granularity of relevance, we show that our model is capable of identifying relevance at the token level. To do this, we present a new benchmark TLR-FEVER focusing on token-level interpretability -- humans annotate tokens in relevant evidences they considered essential when making their judgment. Then we measure how similar are these annotations to the tokens our model is focusing on.
翻译:我们提出Claim-Dissector:一种用于事实核查与分析的新型潜变量模型。该模型在给定声明及一组检索证据的条件下,联合学习识别:(i)与给定声明相关的证据,(ii)声明的真实性。我们提议以可解释的方式分离每条证据的相关性概率及其对最终真实性概率的贡献——最终真实性概率与各证据相关性概率的线性集成成正比。由此,各证据对最终预测概率的个体贡献可被识别。在每条证据的相关性概率中,我们的模型能进一步区分每条相关证据是支持(S)还是反驳(R)该声明。这允许量化S/R概率对最终裁决的贡献程度,或检测出相矛盾的证据。尽管具有可解释性,我们的系统在FEVER数据集上取得了与典型两阶段系统流程相竞争的结果,且使用的参数显著更少。该系统还在FAVIQ和RealFC数据集上创下新的最佳性能。此外,我们的分析表明,该模型能利用粗粒度监督学习细粒度相关性线索,并通过两种方式验证:(i)我们证明,在仅使用段落级相关性监督时,模型能达到具有竞争力的句子召回率。(ii)向最细粒度的相关性层级深入,我们展示模型能识别令牌级相关性。为此,我们提出专注于令牌级可解释性的新基准TLR-FEVER——人工标注者标记他们认为做出判断时关键的相关证据中的令牌。随后我们测量这些标注与模型关注的令牌之间的相似度。