LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

Lianmin Zheng,Wei-Lin Chiang,Ying Sheng,Tianle Li,Siyuan Zhuang,Zhanghao Wu,Yonghao Zhuang,Zhuohan Li,Zi Lin,Eric P. Xing,Joseph E. Gonzalez,Ion Stoica,Hao Zhang

Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities. The dataset is publicly available at https://huggingface.co/datasets/lmsys/lmsys-chat-1m.

翻译：研究人们在真实场景中与大型语言模型（LLMs）的交互方式因其在各类应用中的广泛使用而日益重要。本文提出LMSYS-Chat-1M，一个包含百万级与25个最先进LLMs真实对话的大规模数据集。该数据集从我们的Vicuna演示和Chatbot Arena网站的21万个独立IP地址收集而来。我们概述了数据集的内容，包括其整理过程、基本统计和主题分布，凸显其多样性、原创性和规模性。通过四个应用案例展示其多功能性：开发表现与GPT-4相当的内容审核模型、构建安全基准、训练表现与Vicuna相似的指令跟随模型，以及创建具有挑战性的基准测试问题。我们相信该数据集将成为理解和提升LLM能力的重要资源。数据集公开于https://huggingface.co/datasets/lmsys/lmsys-chat-1m。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日