LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

Lianmin Zheng,Wei-Lin Chiang,Ying Sheng,Tianle Li,Siyuan Zhuang,Zhanghao Wu,Yonghao Zhuang,Zhuohan Li,Zi Lin,Eric. P Xing,Joseph E. Gonzalez,Ion Stoica,Hao Zhang

Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities. The dataset is publicly available at https://huggingface.co/datasets/lmsys/lmsys-chat-1m.

翻译：研究人们在真实场景中如何与大规模语言模型（LLM）交互，因其在各应用领域的广泛使用而日益重要。本文提出LMSYS-Chat-1M——一个包含百万级真实对话的大规模数据集，涵盖与25个先进LLM的交互。该数据集通过我们的Vicuna演示和Chatbot Arena网站从野外环境收集，涉及21万个独立IP地址。我们概述了数据集内容，包括其整理流程、基本统计信息和主题分布，突出其多样性、原创性和规模优势。通过四个应用案例展示其多用途性：开发性能与GPT-4相当的内容审核模型、构建安全基准测试、训练性能与Vicuna相当的指令遵循模型，以及生成高难度基准测试问题。我们相信该数据集将成为理解和提升LLM能力的宝贵资源。数据集已公开于https://huggingface.co/datasets/lmsys/lmsys-chat-1m。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日