We propose a novel clustering pipeline to detect and characterize influence campaigns from documents. This approach clusters parts of document, detects clusters that likely reflect an influence campaign, and then identifies documents linked to an influence campaign via their association with the high-influence clusters. Our approach outperforms both the direct document-level classification and the direct document-level clustering approach in predicting if a document is part of an influence campaign. We propose various novel techniques to enhance our pipeline, including using an existing event factuality prediction system to obtain document parts, and aggregating multiple clustering experiments to improve the performance of both cluster and document classification. Classifying documents on the top of clustering not only accurately extracts the parts of the documents that are relevant to influence campaigns, but also capture influence campaigns as a coordinated and holistic phenomenon. Our approach makes possible more fine-grained and interpretable characterizations of influence campaigns from documents.
翻译:我们提出了一种新颖的聚类流程,用于从文档中检测并刻画影响力活动。该方法对文档各部分进行聚类,识别可能反映影响力活动的簇,进而通过文档与高影响力簇的关联,确定哪些文档与影响力活动相关。在预测文档是否属于影响力活动方面,我们的方法优于直接的文档级分类和直接的文档级聚类方法。我们提出了多种增强流程的新技术,包括利用现有的事件事实性预测系统获取文档部分,以及聚合多次聚类实验以提升簇分类与文档分类的性能。在聚类基础上进行文档分类,不仅能精准提取与影响力活动相关的文档部分,还能将影响力活动视为一种协调且整体的现象予以捕捉。我们的方法使从文档中对影响力活动进行更细粒度、更具可解释性的刻画成为可能。