Synthetic data is increasingly used to support research without exposing sensitive user content. Social media data is one of the types of datasets that would hugely benefit from representative synthetic equivalents that can be used to bootstrap research and allow reproducibility through data sharing. However, recent studies show that (tabular) synthetic data is not inherently privacy-preserving. Much less is known, however, about the privacy risks of synthetically generated unstructured texts. This work evaluates the privacy of synthetic Instagram posts generated by three state-of-the-art large language models using two prompting strategies. We propose a methodology that quantifies privacy by framing re-identification as an authorship attribution attack. A RoBERTa-large classifier trained on real posts achieved 81\% accuracy in authorship attribution on real data, but only 16.5--29.7\% on synthetic posts, showing reduced, though non-negligible, risk. Fidelity was assessed via text traits, sentiment, topic overlap, and embedding similarity, confirming the expected trade-off: higher fidelity coincides with greater privacy leakage. This work provides a framework for evaluating privacy in synthetic text and demonstrates the privacy--fidelity tension in social media datasets.
翻译:合成数据日益被用于支持研究而不暴露敏感用户内容。社交媒体数据是能够从具有代表性的合成等价物中极大获益的数据类型之一,这些合成数据可用于启动研究并通过数据共享实现可重复性。然而,近期研究表明,(表格型)合成数据并非天生具有隐私保护性。而对于合成生成的非结构化文本的隐私风险,目前所知甚少。本研究评估了由三种最先进的大型语言模型使用两种提示策略生成的合成Instagram帖子的隐私性。我们提出了一种方法,通过将重新识别问题构建为作者归属攻击来量化隐私。一个在真实帖子上训练的RoBERTa-large分类器在真实数据上的作者归属准确率达到81%,但在合成帖子上仅为16.5%至29.7%,表明风险虽有所降低但仍不可忽视。保真度通过文本特征、情感、主题重叠和嵌入相似性进行评估,证实了预期的权衡:更高的保真度伴随着更大的隐私泄露。本研究为评估合成文本的隐私性提供了一个框架,并展示了社交媒体数据集中存在的隐私-保真度权衡关系。