Many of the recent breakthroughs in language modeling have resulted from scaling effectively the same model architecture to larger datasets. In this vein, recent work has highlighted performance gains from increasing training dataset size and quality, suggesting a need for novel sources of large-scale datasets. In this work, we introduce BeanCounter, a public dataset consisting of more than 159B tokens extracted from businesses' disclosures. We show that this data is indeed novel: less than 0.1% of BeanCounter appears in Common Crawl-based datasets and it is an order of magnitude larger than datasets relying on similar sources. Given the data's provenance, we hypothesize that BeanCounter is comparatively more factual and less toxic than web-based datasets. Exploring this hypothesis, we find that many demographic identities occur with similar prevalence in BeanCounter but with significantly less toxic context relative to other datasets. To demonstrate the utility of BeanCounter, we evaluate and compare two LLMs continually pre-trained on BeanCounter with their base models. We find an 18-33% reduction in toxic generation and improved performance within the finance domain for the continually pretrained models. Collectively, our work suggests that BeanCounter is a novel source of low-toxicity and high-quality domain-specific data with sufficient scale to train multi-billion parameter LLMs.
翻译:语言建模领域的许多近期突破源于将相同模型架构有效扩展至更大规模数据集。基于这一思路,近期研究强调了通过增加训练数据集规模与质量带来的性能提升,这表明需要新的大规模数据来源。本研究推出BeanCounter——一个从企业披露文件中提取的、包含超过1590亿标记的公开数据集。我们证明该数据具有显著新颖性:BeanCounter中仅不足0.1%的内容出现在基于Common Crawl的数据集中,且其规模比依赖类似来源的数据集大一个数量级。鉴于数据来源特性,我们假设BeanCounter相较于网络数据集具有更高的事实性与更低的毒性。通过验证该假设,我们发现许多人口统计身份标识在BeanCounter中出现频率相近,但其上下文毒性显著低于其他数据集。为展示BeanCounter的实用价值,我们评估并比较了两个基于BeanCounter持续预训练的大语言模型与其基础模型。结果显示持续预训练模型在毒性生成方面降低18-33%,并在金融领域任务中表现更优。综合而言,我们的研究表明BeanCounter是一个新颖的低毒性、高质量领域特定数据源,其规模足以训练具有数十亿参数的大语言模型。