With the rise of Web 2.0 platforms such as online social media, people's private information, such as their location, occupation and even family information, is often inadvertently disclosed through online discussions. Therefore, it is important to detect such unwanted privacy disclosures to help alert people affected and the online platform. In this paper, privacy disclosure detection is modeled as a multi-label text classification (MLTC) problem, and a new privacy disclosure detection model is proposed to construct an MLTC classifier for detecting online privacy disclosures. This classifier takes an online post as the input and outputs multiple labels, each reflecting a possible privacy disclosure. The proposed presentation method combines three different sources of information, the input text itself, the label-to-text correlation and the label-to-label correlation. A double-attention mechanism is used to combine the first two sources of information, and a graph convolutional network (GCN) is employed to extract the third source of information that is then used to help fuse features extracted from the first two sources of information. Our extensive experimental results, obtained on a public dataset of privacy-disclosing posts on Twitter, demonstrated that our proposed privacy disclosure detection method significantly and consistently outperformed other state-of-the-art methods in terms of all key performance indicators.
翻译:随着在线社交媒体等Web 2.0平台的兴起,人们的位置、职业乃至家庭信息等隐私内容常通过在线讨论被无意泄露。因此,检测此类非自愿隐私泄露对于提醒受影响用户及平台运营方具有重要意义。本文将隐私泄露检测建模为多标签文本分类(MLTC)问题,并提出一种新型隐私泄露检测模型,用于构建MLTC分类器以识别在线隐私泄露。该分类器将在线帖子作为输入,输出多个标签,每个标签反映一种可能的隐私泄露。所提出的表示方法融合了三种不同来源的信息:输入文本本身、标签与文本的相关性以及标签之间的相关性。采用双重注意力机制整合前两类信息,并利用图卷积网络(GCN)提取第三类信息,进而辅助融合前两类信息提取的特征。基于推特隐私泄露帖子公开数据集的大量实验结果表明,我们提出的隐私泄露检测方法在各项关键性能指标上均显著且持续优于其他现有最优方法。