Accurate bot detection is necessary for the safety and integrity of online platforms. It is also crucial for research on the influence of bots in elections, the spread of misinformation, and financial market manipulation. Platforms deploy infrastructure to flag or remove automated accounts, but their tools and data are not publicly available. Thus, the public must rely on third-party bot detection. These tools employ machine learning and often achieve near perfect performance for classification on existing datasets, suggesting bot detection is accurate, reliable and fit for use in downstream applications. We provide evidence that this is not the case and show that high performance is attributable to limitations in dataset collection and labeling rather than sophistication of the tools. Specifically, we show that simple decision rules -- shallow decision trees trained on a small number of features -- achieve near-state-of-the-art performance on most available datasets and that bot detection datasets, even when combined together, do not generalize well to out-of-sample datasets. Our findings reveal that predictions are highly dependent on each dataset's collection and labeling procedures rather than fundamental differences between bots and humans. These results have important implications for both transparency in sampling and labeling procedures and potential biases in research using existing bot detection tools for pre-processing.
翻译:准确的机器人检测对于在线平台的安全性和完整性至关重要,同时对于研究机器人对选举的影响、虚假信息的传播以及金融市场操纵也具有重要意义。平台会部署基础设施来标记或移除自动化账户,但其工具和数据不对外公开。因此,公众必须依赖第三方机器人检测。这些工具采用机器学习方法,在现有数据集上通常能达到近乎完美的分类性能,这表明机器人检测准确、可靠,并适用于下游应用。我们提供的证据表明情况并非如此,并揭示了高性能归因于数据集收集和标注的局限性,而非工具的先进性。具体而言,我们证明了简单的决策规则——基于少量特征训练的浅层决策树——能在大多数现有数据集上达到接近最优的性能,并且机器人检测数据集即使合并在一起,也难以良好地泛化到样本外数据集。我们的发现表明,预测结果高度依赖于每个数据集的收集和标注流程,而非机器人与人类之间的根本差异。这些结果对采样和标注流程的透明度,以及使用现有机器人检测工具进行预处理的研究中潜在的偏差,都具有重要启示。