The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
翻译:社交媒体用户立场检测与机器人检测方法的发展高度依赖大规模、高质量的基准数据集。然而,现有基准数据集除标注质量较低外,普遍存在用户关系不完备的问题,这制约了基于图的账号检测研究。为此,我们提出基于多关系图的Twitter账号检测基准数据集(MGTAB),这是首个标准化的图结构账号检测基准。据我们所知,MGTAB基于该领域规模最大的原始数据构建,包含超过155万用户和1.3亿条推文。该数据集涵盖10,199名专家标注用户及7种关系类型,确保了高标注质量与关系多样性。我们提取了信息增益最大的20项用户属性特征及用户推文特征作为用户特征。此外,我们系统评估了MGTAB及其他公开数据集。实验发现,基于图的方法普遍优于基于特征的方法,且引入多种关系时性能更优。通过分析实验结果,我们识别出账号检测的有效方法,并指出了该领域潜在的研究方向。我们的基准数据集及标准化评估流程可通过 https://github.com/GraphDetec/MGTAB 免费获取。