Text classification methods have been widely investigated as a way to detect content of low credibility: fake news, social media bots, propaganda, etc. Quite accurate models (likely based on deep neural networks) help in moderating public electronic platforms and often cause content creators to face rejection of their submissions or removal of already published texts. Having the incentive to evade further detection, content creators try to come up with a slightly modified version of the text (known as an attack with an adversarial example) that exploit the weaknesses of classifiers and result in a different output. Here we introduce BODEGA: a benchmark for testing both victim models and attack methods on four misinformation detection tasks in an evaluation framework designed to simulate real use-cases of content moderation. We also systematically test the robustness of popular text classifiers against available attacking techniques and discover that, indeed, in some cases barely significant changes in input text can mislead the models. We openly share the BODEGA code and data in hope of enhancing the comparability and replicability of further research in this area.
翻译:文本分类方法已被广泛研究,用于检测低可信度内容:假新闻、社交媒体机器人、宣传等。相当准确的模型(可能基于深度神经网络)有助于监管公共电子平台,并常导致内容创作者提交的内容被拒或已发布文本被删除。为规避后续检测,内容创作者试图生成略微修改的文本版本(即利用对抗样本的攻击),以利用分类器的弱点,从而得到不同输出。本文提出BODEGA:一个基准测试,在模拟内容审核真实用例的评估框架中,针对四项虚假信息检测任务,既测试受害模型,也测试攻击方法。我们系统性地测试了流行文本分类器对现有攻击技术的鲁棒性,并发现,在某些情况下,输入文本的微小变化确实能误导模型。我们公开分享BODEGA的代码和数据,以期增强该领域后续研究的可比性和可复现性。