With the rise of Web 2.0 platforms such as online social media, people's private information, such as their location, occupation and even family information, is often inadvertently disclosed through online discussions. Therefore, it is important to detect such unwanted privacy disclosures to help alert people affected and the online platform. In this paper, privacy disclosure detection is modeled as a multi-label text classification (MLTC) problem, and a new privacy disclosure detection model is proposed to construct an MLTC classifier for detecting online privacy disclosures. This classifier takes an online post as the input and outputs multiple labels, each reflecting a possible privacy disclosure. The proposed presentation method combines three different sources of information, the input text itself, the label-to-text correlation and the label-to-label correlation. A double-attention mechanism is used to combine the first two sources of information, and a graph convolutional network (GCN) is employed to extract the third source of information that is then used to help fuse features extracted from the first two sources of information. Our extensive experimental results, obtained on a public dataset of privacy-disclosing posts on Twitter, demonstrated that our proposed privacy disclosure detection method significantly and consistently outperformed other state-of-the-art methods in terms of all key performance indicators.
翻译:随着社交媒体等Web 2.0平台的兴起,人们的位置、职业甚至家庭信息等隐私信息经常在在线讨论中无意泄露。因此,检测此类非自愿的隐私泄露对于提醒受影响用户及在线平台具有重要意义。本文将隐私泄露检测建模为多标签文本分类(MLTC)问题,并提出了一种新的隐私泄露检测模型来构建MLTC分类器。该分类器以在线帖子为输入,输出多个标签,每个标签反映一种可能的隐私泄露。所提出的表示方法融合了三类信息:输入文本本身、标签与文本之间的相关性以及标签间的相关性。采用双重注意力机制融合前两类信息,并利用图卷积网络(GCN)提取第三类信息,用于辅助融合从前两类信息中提取的特征。在公开的Twitter隐私泄露帖子数据集上进行的广泛实验结果表明,本文提出的隐私泄露检测方法在所有关键性能指标上均显著且持续优于其他现有最优方法。