Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities. The dataset is publicly available at https://huggingface.co/datasets/lmsys/lmsys-chat-1m.
翻译:研究人们在真实场景中如何与大规模语言模型(LLM)交互,因其在各应用领域的广泛使用而日益重要。本文提出LMSYS-Chat-1M——一个包含百万级真实对话的大规模数据集,涵盖与25个先进LLM的交互。该数据集通过我们的Vicuna演示和Chatbot Arena网站从野外环境收集,涉及21万个独立IP地址。我们概述了数据集内容,包括其整理流程、基本统计信息和主题分布,突出其多样性、原创性和规模优势。通过四个应用案例展示其多用途性:开发性能与GPT-4相当的内容审核模型、构建安全基准测试、训练性能与Vicuna相当的指令遵循模型,以及生成高难度基准测试问题。我们相信该数据集将成为理解和提升LLM能力的宝贵资源。数据集已公开于https://huggingface.co/datasets/lmsys/lmsys-chat-1m。