In this paper, we present a novel method for detecting fake and Large Language Model (LLM)-generated profiles in the LinkedIn Online Social Network immediately upon registration and before establishing connections. Early fake profile identification is crucial to maintaining the platform's integrity since it prevents imposters from acquiring the private and sensitive information of legitimate users and from gaining an opportunity to increase their credibility for future phishing and scamming activities. This work uses textual information provided in LinkedIn profiles and introduces the Section and Subsection Tag Embedding (SSTE) method to enhance the discriminative characteristics of these data for distinguishing between legitimate profiles and those created by imposters manually or by using an LLM. Additionally, the dearth of a large publicly available LinkedIn dataset motivated us to collect 3600 LinkedIn profiles for our research. We will release our dataset publicly for research purposes. This is, to the best of our knowledge, the first large publicly available LinkedIn dataset for fake LinkedIn account detection. Within our paradigm, we assess static and contextualized word embeddings, including GloVe, Flair, BERT, and RoBERTa. We show that the suggested method can distinguish between legitimate and fake profiles with an accuracy of about 95% across all word embeddings. In addition, we show that SSTE has a promising accuracy for identifying LLM-generated profiles, despite the fact that no LLM-generated profiles were employed during the training phase, and can achieve an accuracy of approximately 90% when only 20 LLM-generated profiles are added to the training set. It is a significant finding since the proliferation of several LLMs in the near future makes it extremely challenging to design a single system that can identify profiles created with various LLMs.
翻译:本文提出一种新型方法,可在领英在线社交网络中,于用户注册完成且建立连接之前,即时检测伪造及大语言模型(LLM)生成的档案。早期识别虚假档案对维护平台完整性至关重要,因为此举能防止冒充者获取合法用户的私密敏感信息,并阻止其借机提升可信度,以实施后续的网络钓鱼与诈骗活动。本研究利用领英档案中的文本信息,引入章节与子章节标签嵌入(SSTE)方法,以增强这些数据在区分合法档案与人工或LLM伪造档案时的判别特征。此外,由于缺乏大型公开领英数据集,我们自行采集了3600份领英档案用于研究,并将公开发布该数据集以供学术用途。据我们所知,这是首个面向虚假领英账号检测的大型公开数据集。在框架内,我们评估了静态与上下文词嵌入方法(包括GloVe、Flair、BERT、RoBERTa)。实验表明,所提方法在所有词嵌入模型下均能实现约95%的准确率,有效区分合法与虚假档案。更重要的是,尽管训练阶段未使用任何LLM生成的档案,SSTE仍能在识别此类档案方面展现出极具前景的准确率;当训练集仅加入20份LLM生成档案时,其准确率可达约90%。这一发现意义重大,因为未来多种LLM的广泛普及将使设计单一系统来识别不同LLM生成的档案变得极为困难。