Fake detection in imbalance dataset by Semi-supervised learning with GAN

As social media continues to grow rapidly, the prevalence of harassment on these platforms has also increased. This has piqued the interest of researchers in the field of fake detection. Social media data, often forms complex graphs with numerous nodes, posing several challenges. These challenges and limitations include dealing with a significant amount of irrelevant features in matrices and addressing issues such as high data dispersion and an imbalanced class distribution within the dataset. To overcome these challenges and limitations, researchers have employed auto-encoders and a combination of semi-supervised learning with a GAN algorithm, referred to as SGAN. Our proposed method utilizes auto-encoders for feature extraction and incorporates SGAN. By leveraging an unlabeled dataset, the unsupervised layer of SGAN compensates for the limited availability of labeled data, making efficient use of the limited number of labeled instances. Multiple evaluation metrics were employed, including the Confusion Matrix and the ROC curve. The dataset was divided into training and testing sets, with 100 labeled samples for training and 1,000 samples for testing. The novelty of our research lies in applying SGAN to address the issue of imbalanced datasets in fake account detection. By optimizing the use of a smaller number of labeled instances and reducing the need for extensive computational power, our method offers a more efficient solution. Additionally, our study contributes to the field by achieving an 81% accuracy in detecting fake accounts using only 100 labeled samples. This demonstrates the potential of SGAN as a powerful tool for handling minority classes and addressing big data challenges in fake account detection.

翻译：随着社交媒体持续快速发展，平台上骚扰现象的普遍性也随之增加，这激发了研究者对虚假检测领域的兴趣。社交媒体数据通常形成包含大量节点的复杂图结构，带来多重挑战。这些挑战与局限包括处理矩阵中大量无关特征、解决数据高度离散化及数据集内类别分布不平衡等问题。为克服这些困难，研究者采用了自编码器及结合半监督学习与生成对抗网络（GAN）的算法（简称SGAN）。本文提出的方法利用自编码器进行特征提取，并融入SGAN框架。通过利用未标记数据集，SGAN的无监督层能补偿标注数据不足的问题，高效利用有限的标注样本。我们采用混淆矩阵和ROC曲线等多种评估指标，将数据集划分为训练集（100个标注样本）和测试集（1000个样本）。本研究的创新点在于将SGAN应用于解决虚假账户检测中的不平衡数据集问题。通过优化少量标注样本的利用效率并降低对大量计算资源的依赖，该方法提供了更高效的解决方案。此外，本研究在仅使用100个标注样本的情况下达到81%的虚假账户检测准确率，展现了SGAN作为处理少数类样本及应对虚假检测领域大数据挑战的有效工具的潜力。