Simplistic Collection and Labeling Practices Limit the Utility of Benchmark Datasets for Twitter Bot Detection

Accurate bot detection is necessary for the safety and integrity of online platforms. It is also crucial for research on the influence of bots in elections, the spread of misinformation, and financial market manipulation. Platforms deploy infrastructure to flag or remove automated accounts, but their tools and data are not publicly available. Thus, the public must rely on third-party bot detection. These tools employ machine learning and often achieve near perfect performance for classification on existing datasets, suggesting bot detection is accurate, reliable and fit for use in downstream applications. We provide evidence that this is not the case and show that high performance is attributable to limitations in dataset collection and labeling rather than sophistication of the tools. Specifically, we show that simple decision rules -- shallow decision trees trained on a small number of features -- achieve near-state-of-the-art performance on most available datasets and that bot detection datasets, even when combined together, do not generalize well to out-of-sample datasets. Our findings reveal that predictions are highly dependent on each dataset's collection and labeling procedures rather than fundamental differences between bots and humans. These results have important implications for both transparency in sampling and labeling procedures and potential biases in research using existing bot detection tools for pre-processing.

翻译：准确的机器人检测对于在线平台的安全性和完整性至关重要，同时对于研究机器人对选举的影响、虚假信息的传播以及金融市场操纵也具有重要意义。平台会部署基础设施来标记或移除自动化账户，但其工具和数据不对外公开。因此，公众必须依赖第三方机器人检测。这些工具采用机器学习方法，在现有数据集上通常能达到近乎完美的分类性能，这表明机器人检测准确、可靠，并适用于下游应用。我们提供的证据表明情况并非如此，并揭示了高性能归因于数据集收集和标注的局限性，而非工具的先进性。具体而言，我们证明了简单的决策规则——基于少量特征训练的浅层决策树——能在大多数现有数据集上达到接近最优的性能，并且机器人检测数据集即使合并在一起，也难以良好地泛化到样本外数据集。我们的发现表明，预测结果高度依赖于每个数据集的收集和标注流程，而非机器人与人类之间的根本差异。这些结果对采样和标注流程的透明度，以及使用现有机器人检测工具进行预处理的研究中潜在的偏差，都具有重要启示。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日