BeanCounter: A low-toxicity, large-scale, and open dataset of business-oriented text

Many of the recent breakthroughs in language modeling have resulted from scaling effectively the same model architecture to larger datasets. In this vein, recent work has highlighted performance gains from increasing training dataset size and quality, suggesting a need for novel sources of large-scale datasets. In this work, we introduce BeanCounter, a public dataset consisting of more than 159B tokens extracted from businesses' disclosures. We show that this data is indeed novel: less than 0.1% of BeanCounter appears in Common Crawl-based datasets and it is an order of magnitude larger than datasets relying on similar sources. Given the data's provenance, we hypothesize that BeanCounter is comparatively more factual and less toxic than web-based datasets. Exploring this hypothesis, we find that many demographic identities occur with similar prevalence in BeanCounter but with significantly less toxic context relative to other datasets. To demonstrate the utility of BeanCounter, we evaluate and compare two LLMs continually pre-trained on BeanCounter with their base models. We find an 18-33% reduction in toxic generation and improved performance within the finance domain for the continually pretrained models. Collectively, our work suggests that BeanCounter is a novel source of low-toxicity and high-quality domain-specific data with sufficient scale to train multi-billion parameter LLMs.

翻译：语言建模领域的许多近期突破源于将相同模型架构有效扩展至更大规模数据集。基于这一思路，近期研究强调了通过增加训练数据集规模与质量带来的性能提升，这表明需要新的大规模数据来源。本研究推出BeanCounter——一个从企业披露文件中提取的、包含超过1590亿标记的公开数据集。我们证明该数据具有显著新颖性：BeanCounter中仅不足0.1%的内容出现在基于Common Crawl的数据集中，且其规模比依赖类似来源的数据集大一个数量级。鉴于数据来源特性，我们假设BeanCounter相较于网络数据集具有更高的事实性与更低的毒性。通过验证该假设，我们发现许多人口统计身份标识在BeanCounter中出现频率相近，但其上下文毒性显著低于其他数据集。为展示BeanCounter的实用价值，我们评估并比较了两个基于BeanCounter持续预训练的大语言模型与其基础模型。结果显示持续预训练模型在毒性生成方面降低18-33%，并在金融领域任务中表现更优。综合而言，我们的研究表明BeanCounter是一个新颖的低毒性、高质量领域特定数据源，其规模足以训练具有数十亿参数的大语言模型。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日