Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research

Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. However, access to these datasets is often restricted due to costs and platform regulations. As such, acquiring datasets that span multiple platforms which are crucial for a comprehensive understanding of the digital ecosystem is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real datasets. We employ ChatGPT to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings suggest that using large language models to generate synthetic multi-platform social media data is promising. However, further enhancements are necessary to improve the fidelity of the outputs.

翻译：社交媒体数据集对于虚假信息、影响力操作、社会感知、仇恨言论检测、网络欺凌及其他重要议题的研究至关重要。然而，由于成本和平台监管限制，获取这些数据集往往受限。因此，获取跨多平台的数据集——这对全面理解数字生态系统尤为关键——尤其具有挑战性。本文探讨了利用大语言模型生成跨多平台、在词汇和语义层面相关的社交媒体数据集的潜力，旨在匹配真实数据集的质量。我们使用ChatGPT从两个真实数据集生成合成数据，每个数据集包含来自三个不同社交媒体平台的帖子。我们评估了合成数据的词汇和语义特性，并将其与真实数据进行比较。实证结果表明，利用大语言模型生成合成多平台社交媒体数据具有前景，但需进一步改进以提升输出结果的保真度。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日