While academic research typically treats Large Language Models (LLM) as generic text generators, they are distinct commercial products with unique interfaces and capabilities that fundamentally shape user behavior. Current datasets obscure this reality by collecting text-only data through uniform interfaces that fail to capture authentic chatbot usage. To address this limitation, we present ShareChat, a large-scale corpus of 142,808 conversations (660,293 turns) sourced directly from publicly shared URLs on ChatGPT, Perplexity, Grok, Gemini, and Claude. ShareChat distinguishes itself by preserving native platform affordances, such as citations and thinking traces, across a diverse collection covering 101 languages and the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. To illustrate the dataset's breadth, we present three case studies: a completeness analysis of intent satisfaction, a citation study of model grounding, and a temporal analysis of engagement rhythms. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild. The dataset is publicly available via Hugging Face.
翻译:尽管学术研究通常将大语言模型(LLM)视为通用文本生成器,但它们实际上是具有独特界面和功能的商业化产品,这些特性从根本上塑造了用户行为。当前的数据集通过统一界面收集纯文本数据,未能捕捉真实的聊天机器人使用情况,从而掩盖了这一现实。为弥补这一不足,我们提出了ShareChat——一个从ChatGPT、Perplexity、Grok、Gemini和Claude等平台的公开分享链接直接收集的大规模语料库,包含142,808段对话(共660,293轮次)。ShareChat的独特之处在于保留了原生平台的功能特性(如引用和思维链痕迹),覆盖101种语言,时间跨度为2023年4月至2025年10月。此外,与现有数据集相比,ShareChat提供了更长的上下文窗口和更深的交互层次。为展示数据集的广度,我们呈现了三项案例研究:意图满足的完整性分析、模型引证基础研究以及参与节奏的时间分析。这项工作为学术界理解真实环境中用户与大语言模型聊天机器人的交互提供了重要且及时的资源。该数据集已通过Hugging Face平台公开。