Large Language Models (LLMs) can strongly shape social discourse, yet datasets investigating how LLM outputs vary across controlled social and contextual prompting remain sparse. Cognitive Digital Shadows (CDS) is a 190,000-record synthetic corpus supporting analyses of LLM-generated discourse. Each CDS record is generated by one of 19 LLMs, prompted to shadow either a human persona or an AI-assistant role. CDS contains LLM responses on 4 controversial societal topics: vaccines/healthcare, social media disinformation, the gender gap in science, and STEM stereotypes. Persona-conditioned records encode 17 sociodemographic and psychological attributes, providing data linking LLMs' prompts, language, stances and reasoning. Texts are validated for topic anchoring and can support emotional analyses via interpretable NLP (e.g. textual forma mentis networks). CDS is enriched by a pooling platform with user-friendly dashboards, enabling easy, interactive group-level comparisons of emotional and semantic framing across personas, topics and models. The CDS prompting framework supports future audits of LLMs' bias, social sensitivity and alignment.
翻译:大语言模型(LLMs)能够深刻影响社会话语,然而,研究LLM输出在受控的社会与语境提示下如何变化的公开数据集仍然稀缺。认知数字影子(CDS)是一个包含19万条记录的合成语料库,支持对LLM生成话语的分析。每条CDS记录由19个LLM中的一个生成,并被提示模拟人类角色或AI助手角色。CDS包含LLM对四个具有争议的社会议题的回应:疫苗/医疗保健、社交媒体虚假信息、科学领域的性别差距以及STEM刻板印象。基于角色条件的记录编码了17个社会人口学与心理属性,提供了关联LLM提示、语言、立场与推理的数据。文本经过主题锚定验证,并可通过可解释的自然语言处理(如文本形态网络)支持情感分析。CDS通过一个整合平台得到增强,该平台配备用户友好的仪表盘,能够轻松地对不同角色、主题和模型进行交互式群体层面的情感与语义框架比较。CDS提示框架可支持未来对LLM的偏见、社会敏感性与对齐性的审计。