Fake detection in imbalance dataset by Semi-supervised learning with GAN

As social media continues to grow rapidly, the prevalence of harassment on these platforms has also increased. This has piqued the interest of researchers in the field of fake detection. Social media data, often forms complex graphs with numerous nodes, posing several challenges. These challenges and limitations include dealing with a significant amount of irrelevant features in matrices and addressing issues such as high data dispersion and an imbalanced class distribution within the dataset. To overcome these challenges and limitations, researchers have employed auto-encoders and a combination of semi-supervised learning with a GAN algorithm, referred to as SGAN. Our proposed method utilizes auto-encoders for feature extraction and incorporates SGAN. By leveraging an unlabeled dataset, the unsupervised layer of SGAN compensates for the limited availability of labeled data, making efficient use of the limited number of labeled instances. Multiple evaluation metrics were employed, including the Confusion Matrix and the ROC curve. The dataset was divided into training and testing sets, with 100 labeled samples for training and 1,000 samples for testing. The novelty of our research lies in applying SGAN to address the issue of imbalanced datasets in fake account detection. By optimizing the use of a smaller number of labeled instances and reducing the need for extensive computational power, our method offers a more efficient solution. Additionally, our study contributes to the field by achieving an 81% accuracy in detecting fake accounts using only 100 labeled samples. This demonstrates the potential of SGAN as a powerful tool for handling minority classes and addressing big data challenges in fake account detection.

翻译：随着社交媒体的持续快速发展，平台上的骚扰行为也日益普遍，这引发了研究者对虚假检测领域的浓厚兴趣。社交媒体数据通常形成具有大量节点的复杂图结构，带来了诸多挑战。这些挑战和局限性包括处理矩阵中大量无关特征，以及应对数据高离散度和数据集内类别分布不平衡等问题。为克服这些挑战与局限，研究者采用了自编码器以及结合半监督学习与GAN算法的SGAN方法。我们提出的方法利用自编码器进行特征提取，并整合了SGAN算法。通过利用未标注数据集，SGAN的无监督层弥补了标注数据有限的不足，从而高效利用了少量标注样本。研究采用了多种评估指标，包括混淆矩阵和ROC曲线。数据集被划分为训练集和测试集，其中100个标注样本用于训练，1000个样本用于测试。本研究的创新点在于应用SGAN解决虚假账户检测中的数据集不平衡问题。通过优化少量标注样本的利用率并降低对大量计算资源的需求，我们的方法提供了更高效的解决方案。此外，本研究仅使用100个标注样本就实现了81%的虚假账户检测准确率，为该领域做出了贡献。这证明了SGAN作为处理少数类别和应对虚假账户检测中大数据挑战的强大工具的潜力。