[Context] Researchers analyze underground forums to study abuse and cybercrime activities. Due to the size of the forums and the domain expertise required to identify criminal discussions, most approaches employ supervised machine learning techniques to automatically classify the posts of interest. [Goal] Human annotation is costly. How to select samples to annotate that account for the structure of the forum? [Method] We present a methodology to generate stratified samples based on information about the centrality properties of the population and evaluate classifier performance. [Result] We observe that by employing a sample obtained from a uniform distribution of the post degree centrality metric, we maintain the same level of precision but significantly increase the recall (+30%) compared to a sample whose distribution is respecting the population stratification. We find that classifiers trained with similar samples disagree on the classification of criminal activities up to 33% of the time when deployed on the entire forum.
翻译:[背景] 研究人员通过分析地下论坛研究滥用和网络犯罪活动。由于论坛规模庞大且识别犯罪讨论需要专业知识,大多数方法采用监督式机器学习技术来自动分类感兴趣的帖子。[目标] 人工标注成本高昂。如何选择能够反映论坛结构的样本进行标注?[方法] 我们提出一种基于总体中心性属性信息生成分层样本的方法,并评估分类器性能。[结果] 我们发现,与基于总体分层分布的样本相比,采用基于帖子度中心性度量均匀分布的样本,在保持相同精确度的同时,召回率显著提升(+30%)。我们观察到,使用相似样本训练的分类器在部署至整个论坛时,对犯罪活动的分类存在最高33%的分歧。