Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities. The dataset is publicly available at https://huggingface.co/datasets/lmsys/lmsys-chat-1m.
翻译:研究人们在真实场景中与大型语言模型(LLMs)的交互方式因其在各类应用中的广泛使用而日益重要。本文提出LMSYS-Chat-1M,一个包含百万级与25个最先进LLMs真实对话的大规模数据集。该数据集从我们的Vicuna演示和Chatbot Arena网站的21万个独立IP地址收集而来。我们概述了数据集的内容,包括其整理过程、基本统计和主题分布,凸显其多样性、原创性和规模性。通过四个应用案例展示其多功能性:开发表现与GPT-4相当的内容审核模型、构建安全基准、训练表现与Vicuna相似的指令跟随模型,以及创建具有挑战性的基准测试问题。我们相信该数据集将成为理解和提升LLM能力的重要资源。数据集公开于https://huggingface.co/datasets/lmsys/lmsys-chat-1m。