Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to $4.45\%$ over the previous state-of-the-art. The models are available at https://huggingface.co/ikim-uk-essen
翻译:传统上,大型语言模型要么基于通用网络爬取数据训练,要么基于特定领域数据训练。然而,近期生成式大型语言模型取得的成功,揭示了跨领域数据集的诸多优势。为探究优先考虑数据多样性而非数据质量的意义,我们构建了一个包含五个领域文本的德语数据集,以及另一个旨在包含高质量数据的数据集。通过在两个数据集上训练一系列参数量在 122M 到 750M 范围内的模型,我们在多个下游任务上进行了全面基准测试。研究结果表明,基于跨领域数据集训练的模型性能优于仅基于高质量数据训练的模型,相较先前最先进水平提升高达 $4.45\%$。所有模型均可在 https://huggingface.co/ikim-uk-essen 获取。