Large language models have demonstrated remarkable potential in various tasks, however, there remains a significant scarcity of open-source models and data for specific domains. Previous works have primarily focused on manually specifying resources and collecting high-quality data on specific domains, which significantly consume time and effort. To address this limitation, we propose an efficient data collection method~\textit{Query of CC} based on large language models. This method bootstraps seed information through a large language model and retrieves related data from public corpora. It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures. Through the application of this method, we have curated a high-quality dataset called~\textsc{Knowledge Pile}, encompassing four major domains, including stem and humanities sciences, among others. Experimental results demonstrate that~\textsc{Knowledge Pile} significantly improves the performance of large language models in mathematical and knowledge-related reasoning ability tests. To facilitate academic sharing, we open-source our dataset and code, providing valuable support to the academic community.
翻译:大型语言模型在各类任务中已展现出显著潜力,然而特定领域的开源模型与数据仍然十分匮乏。以往研究主要聚焦于人工指定资源并收集特定领域的高质量数据,这一过程需耗费大量时间与人力。为解决该问题,我们提出了一种基于大型语言模型的高效数据收集方法——Query of CC。该方法通过大型语言模型引导种子信息,并从公开语料中检索相关数据。它不仅能为特定领域收集知识相关数据,还能挖掘出蕴含潜在推理过程的数据。通过应用该方法,我们整理了一个名为Knowledge Pile的高质量数据集,涵盖科学、技术、工程、数学及人文学科等四大领域。实验结果表明,Knowledge Pile显著提升了大型语言模型在数学及知识相关推理能力测试中的表现。为促进学术共享,我们开源了数据集与代码,为学术界提供重要支持。