We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.
翻译:我们提出SynBullying,一个用于研究和检测网络欺凌(CB)的合成多LLM对话数据集。SynBullying通过利用大型语言模型(LLM)模拟逼真的欺凌互动,为人类数据收集提供了一种可扩展且符合伦理安全的替代方案。该数据集提供:(i)对话结构,捕获多轮交互而非孤立帖子;(ii)上下文感知注释,在考虑语境、意图和话语动态的对话流程中评估有害性;以及(iii)细粒度标注,覆盖多种CB类别以供详细的语言和行为分析。我们从五个维度评估SynBullying,包括对话结构、词汇模式、情感/毒性、角色动态、伤害强度及CB类型分布。我们进一步通过测试其作为独立训练数据及作为CB分类增强来源的性能来检验其实用性。