Understanding activities of Internet scanners is challenging; it often requires identifying relationships between sources, a task for which semantic annotations are scarce. This work investigates whether semantically meaningful pairwise relationships between sequences of network flow records can be estimated by contrastive learning, without pretraining and without annotations. To this end, we propose a transformer model that embeds minimally preprocessed sequences of network flow records and train it using contrastive learning. With the similarities obtained from this model, we state a correlation clustering problem and solve it locally. Experimentally, we show: Learned similarities are higher on average for sequences originating from the same source than for sequences originating from different sources, and this property generalizes to unseen sequences of unseen sources. Moreover, correlation clustering yields clusters consistent with scanner labels. The complete source code of the algorithms and for reproducing the experiments is publicly available.
翻译:理解互联网扫描器的活动具有挑战性,通常需要识别源之间的关联关系,而这一任务中语义标注十分稀缺。本研究探究是否无需预训练和标注,即可通过对比学习估计网络流记录序列间具有语义意义的成对关联关系。为此,我们提出一种Transformer模型,该模型嵌入经过最小预处理的网络流记录序列,并通过对比学习进行训练。基于该模型获得的相似度,我们定义了一个相关聚类问题并对其进行局部求解。实验结果表明:源自同一源的序列间平均学习相似度高于不同源的序列,且该特性可泛化至未见源和未见序列。此外,相关聚类产生的聚类结果与扫描器标签一致。算法完整源代码及实验复现代码已公开。