Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request headers. From this, we compiled WildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. We compare WildChat with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses, alongside request headers. This augmentation allows for more detailed analysis of user behaviors across different geographical regions and temporal dimensions. Finally, because it captures a broad range of use cases, we demonstrate the dataset's potential utility in fine-tuning instruction-following models. WildChat is released at https://wildchat.allen.ai under AI2 ImpACT Licenses.
翻译:诸如GPT-4和ChatGPT等聊天机器人现已服务数百万用户。尽管其应用广泛,但仍缺乏展示这些工具在实际用户群体中如何被使用的公开数据集。为填补这一空白,我们向在线用户开放ChatGPT的免费访问权限,前提是用户需明确知情同意并自愿选择匿名提交其聊天记录和请求标头。由此我们构建了WildChat语料库,包含100万条用户与ChatGPT的对话记录,总交互轮次超过250万。通过将WildChat与其他流行的用户-聊天机器人交互数据集进行对比,我们发现本数据集涵盖了最多样化的用户提示、最多的语言种类,并为研究者提供了最丰富的潜在有害用例类型。除带时间戳的聊天记录外,我们还通过人口统计数据(包括州/省、国家和哈希处理后的IP地址)及请求标头对数据集进行增强。这一扩充支持对用户行为在不同地理区域和时间维度上进行更细致的分析。最后,由于本数据集捕获了广泛的使用场景,我们展示了其微调指令遵循模型的潜在效用。WildChat已在https://wildchat.allen.ai 依据AI2 ImpACT许可证发布。