Synthetic data is increasingly used to support research without exposing sensitive user content. Social media data is one of the types of datasets that would hugely benefit from representative synthetic equivalents that can be used to bootstrap research and allow reproducibility through data sharing. However, recent studies show that (tabular) synthetic data is not inherently privacy-preserving. Much less is known, however, about the privacy risks of synthetically generated unstructured texts. This work evaluates the privacy of synthetic Instagram posts generated by three state-of-the-art large language models using two prompting strategies. We propose a methodology that quantifies privacy by framing re-identification as an authorship attribution attack. A RoBERTa-large classifier trained on real posts achieved 81\% accuracy in authorship attribution on real data, but only 16.5--29.7\% on synthetic posts, showing reduced, though non-negligible, risk. Fidelity was assessed via text traits, sentiment, topic overlap, and embedding similarity, confirming the expected trade-off: higher fidelity coincides with greater privacy leakage. This work provides a framework for evaluating privacy in synthetic text and demonstrates the privacy--fidelity tension in social media datasets.
翻译:合成数据日益广泛地应用于支持研究,同时避免暴露敏感用户内容。社交媒体数据是能够从具有代表性的合成等价物中极大获益的数据类型之一,这些合成数据可用于引导研究,并通过数据共享实现可复现性。然而,近期研究表明,(表格型)合成数据本身并不具备隐私保护性。而对于合成生成的非结构化文本所面临的隐私风险,目前所知甚少。本研究评估了由三种先进大语言模型使用两种提示策略生成的合成Instagram帖子的隐私性。我们提出一种方法,通过将重识别问题构建为作者归属攻击来量化隐私。一个在真实帖子上训练的RoBERTa-large分类器在真实数据上的作者归属准确率达到81%,但在合成帖子上仅为16.5%至29.7%,表明风险虽有所降低但仍不可忽视。保真度通过文本特征、情感、主题重叠度和嵌入相似性进行评估,证实了预期的权衡关系:更高的保真度伴随着更大的隐私泄露。本研究为评估合成文本的隐私性提供了一个框架,并揭示了社交媒体数据集中存在的隐私-保真度权衡关系。