Fake detection in imbalance dataset by Semi-supervised learning with GAN

As social media continues to grow rapidly, the prevalence of harassment on these platforms has also increased. This has piqued the interest of researchers in the field of fake detection. Social media data, often forms complex graphs with numerous nodes, posing several challenges. These challenges and limitations include dealing with a significant amount of irrelevant features in matrices and addressing issues such as high data dispersion and an imbalanced class distribution within the dataset. To overcome these challenges and limitations, researchers have employed auto-encoders and a combination of semi-supervised learning with a GAN algorithm, referred to as SGAN. Our proposed method utilizes auto-encoders for feature extraction and incorporates SGAN. By leveraging an unlabeled dataset, the unsupervised layer of SGAN compensates for the limited availability of labeled data, making efficient use of the limited number of labeled instances. Multiple evaluation metrics were employed, including the Confusion Matrix and the ROC curve. The dataset was divided into training and testing sets, with 100 labeled samples for training and 1,000 samples for testing. The novelty of our research lies in applying SGAN to address the issue of imbalanced datasets in fake account detection. By optimizing the use of a smaller number of labeled instances and reducing the need for extensive computational power, our method offers a more efficient solution. Additionally, our study contributes to the field by achieving an 81% accuracy in detecting fake accounts using only 100 labeled samples. This demonstrates the potential of SGAN as a powerful tool for handling minority classes and addressing big data challenges in fake account detection.

翻译：随着社交媒体的持续快速增长，这些平台上的骚扰现象也日益增多，这激发了研究人员对虚假检测领域的兴趣。社交媒体数据往往以包含大量节点的复杂图结构呈现，带来诸多挑战。这些挑战与局限包括：处理矩阵中大量无关特征、应对高数据分散性以及数据集中类别分布不平衡等问题。为克服这些困难，研究人员采用了自动编码器，并将其与生成对抗网络（GAN）算法结合形成半监督学习模型（SGAN）。我们提出的方法利用自动编码器进行特征提取，并融合SGAN模型。通过利用未标注数据集，SGAN的无监督层弥补了标注数据有限的不足，从而高效利用少量标注样本。本研究采用混淆矩阵和ROC曲线等多种评估指标，将数据集划分为训练集（100个标注样本）和测试集（1000个样本）。本研究的创新点在于应用SGAN解决虚假账户检测中的不平衡数据集问题。通过优化少量标注样本的使用并降低对大量计算资源的需求，该方法提供了更高效的解决方案。此外，本研究仅使用100个标注样本即实现81%的虚假账户检测准确率，证明了SGAN在处理少数类及应对虚假账户检测中的大数据挑战方面的潜力。