Large Language Models (LLMs) raise concerns about lowering the cost of generating texts that could be used for unethical or illegal purposes, especially on social media. This paper investigates the promise of such models to help enforce legal requirements related to the disclosure of sponsored content online. We investigate the use of LLMs for generating synthetic Instagram captions with two objectives: The first objective (fidelity) is to produce realistic synthetic datasets. For this, we implement content-level and network-level metrics to assess whether synthetic captions are realistic. The second objective (utility) is to create synthetic data that is useful for sponsored content detection. For this, we evaluate the effectiveness of the generated synthetic data for training classifiers to identify undisclosed advertisements on Instagram. Our investigations show that the objectives of fidelity and utility may conflict and that prompt engineering is a useful but insufficient strategy. Additionally, we find that while individual synthetic posts may appear realistic, collectively they lack diversity, topic connectivity, and realistic user interaction patterns.
翻译:大型语言模型(LLMs)引发了对其可能降低用于不道德或非法目的(尤其在社交媒体领域)文本生成成本的担忧。本文探讨了此类模型在协助执行网络赞助内容披露相关法律要求方面的潜力。我们研究了使用LLM生成合成Instagram标题的两个目标:第一个目标(保真度)是生成逼真的合成数据集。为此,我们实现了内容级和网络级指标以评估合成标题的真实性。第二个目标(实用性)是创建对赞助内容检测有用的合成数据。为此,我们评估了生成的合成数据在训练分类器识别Instagram上未披露广告方面的有效性。研究表明,保真度与实用性目标可能相互冲突,提示工程虽有用但策略不足。此外,我们发现尽管单条合成帖子看似真实,但整体缺乏多样性、主题连接性及真实的用户交互模式。