ShareChat: A Dataset of Chatbot Conversations in the Wild

While academic research typically treats Large Language Models (LLM) as generic text generators, they are distinct commercial products with unique interfaces and capabilities that fundamentally shape user behavior. Current datasets obscure this reality by collecting text-only data through uniform interfaces that fail to capture authentic chatbot usage. To address this limitation, we present ShareChat, a large-scale corpus of 142,808 conversations (660,293 turns) sourced directly from publicly shared URLs on ChatGPT, Perplexity, Grok, Gemini, and Claude. ShareChat distinguishes itself by preserving native platform affordances, such as citations and thinking traces, across a diverse collection covering 101 languages and the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. To illustrate the dataset's breadth, we present three case studies: a completeness analysis of intent satisfaction, a citation study of model grounding, and a temporal analysis of engagement rhythms. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild. The dataset is publicly available via Hugging Face.

翻译：尽管学术研究通常将大语言模型（LLM）视为通用文本生成器，但它们实际上是具有独特界面和功能的商业化产品，这些特性从根本上塑造了用户行为。当前的数据集通过统一界面收集纯文本数据，未能捕捉真实的聊天机器人使用情况，从而掩盖了这一现实。为弥补这一不足，我们提出了ShareChat——一个从ChatGPT、Perplexity、Grok、Gemini和Claude等平台的公开分享链接直接收集的大规模语料库，包含142,808段对话（共660,293轮次）。ShareChat的独特之处在于保留了原生平台的功能特性（如引用和思维链痕迹），覆盖101种语言，时间跨度为2023年4月至2025年10月。此外，与现有数据集相比，ShareChat提供了更长的上下文窗口和更深的交互层次。为展示数据集的广度，我们呈现了三项案例研究：意图满足的完整性分析、模型引证基础研究以及参与节奏的时间分析。这项工作为学术界理解真实环境中用户与大语言模型聊天机器人的交互提供了重要且及时的资源。该数据集已通过Hugging Face平台公开。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《知识增强型大语言模型及面向创造力支持的人机协作框架》233页

专知会员服务

34+阅读 · 2025年9月29日

【斯坦福博士论文】为大型语言模型构建交互学习管道

专知会员服务

19+阅读 · 2025年7月12日

揭示生成式人工智能 / 大型语言模型（LLMs）的军事潜力

专知会员服务

32+阅读 · 2024年9月26日

基于大型语言模型的AI聊天机器人的完整综述

专知会员服务

43+阅读 · 2024年6月26日