Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language

This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data. The objective is to assess the accuracy of textual claims using evidence from a ground-truth evidence corpus. The pipeline consists of two main modules -- the evidence retrieval and the claim veracity evaluation. Our primary focus is on the ease of deployment in various languages that remain unexplored in the field of automated fact-checking. Unlike most similar pipelines, which work with evidence sentences, our pipeline processes data on a paragraph level, simplifying the overall architecture and data requirements. Given the high cost of annotating language-specific fact-checking training data, our solution builds on the Question Answering for Claim Generation (QACG) method, which we adapt and use to generate the data for all models of the pipeline. Our strategy enables the introduction of new languages through machine translation of only two fixed datasets of moderate size. Subsequently, any number of training samples can be generated based on an evidence corpus in the target language. We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines, as well as to our codebase that may be used to reproduce the results.We comprehensively evaluate the pipelines for all four languages, including human annotations and per-sample difficulty assessment using Pointwise V-information. The presented experiments are based on full Wikipedia snapshots to promote reproducibility. To facilitate implementation and user interaction, we develop the FactSearch application featuring the proposed pipeline and the preliminary feedback on its performance.

翻译：本文提出了一种利用公开可用的语言模型和数据进行自动化事实核查的流水线。其目标是通过来自真实证据语料库的证据，评估文本声明的准确性。该流水线包含两个主要模块：证据检索和声明真实性评估。我们的核心关注点在于，如何便捷地部署到自动化事实核查领域尚未充分探索的多种语言中。与大多数基于句子级证据运行的同类流水线不同，我们的流水线在段落层面处理数据，从而简化了整体架构和数据需求。鉴于标注特定语言的事实核查训练数据成本高昂，我们的解决方案基于问答式声明生成方法（QACG），并对其进行调整，用于生成流水线中所有模型所需的数据。我们的策略使得仅通过机器翻译两个规模适中的固定数据集，即可引入新语言。随后，可基于目标语言的证据语料库生成任意数量的训练样本。我们公开提供了捷克语、英语、波兰语和斯洛伐克语的流水线所有数据和微调模型，以及可用于复现结果的代码库。我们对所有四种语言的流水线进行了全面评估，包括人工标注和使用逐点V信息（Pointwise V-information）进行的样本难度评估。为促进可复现性，实验基于完整的维基百科快照进行。为便于实现和用户交互，我们开发了集成所提流水线的FactSearch应用程序，并提供了其性能的初步反馈。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日