Large language models have demonstrated remarkable potential in various tasks, however, there remains a significant scarcity of open-source models and data for specific domains. Previous works have primarily focused on manually specifying resources and collecting high-quality data on specific domains, which significantly consume time and effort. To address this limitation, we propose an efficient data collection method $\textit{Query of CC}$ based on large language models. This method bootstraps seed information through a large language model and retrieves related data from public corpora. It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures. Through the application of this method, we have curated a high-quality dataset called KNOWLEDGE PILE, encompassing four major domains, including stem and humanities sciences, among others. Experimental results demonstrate that KNOWLEDGE PILE significantly improves the performance of large language models in mathematical and knowledge-related reasoning ability tests. To facilitate academic sharing, we open-source our dataset and code, providing valuable support to the academic community.
翻译:大型语言模型在各类任务中展现出显著潜力,然而面向特定领域的开源模型与数据仍存在显著匮乏。现有研究主要依赖人工指定资源并收集特定领域的高质量数据,这耗费大量时间与精力。为解决此问题,我们提出一种基于大型语言模型的高效数据收集方法——$\textit{Query of CC}$。该方法通过大型语言模型引导种子信息,并从公共语料库中检索相关数据,不仅能收集特定领域的知识相关数据,还能挖掘具有潜在推理过程的数据。通过应用该方法,我们构建了名为KNOWLEDGE PILE的高质量数据集,涵盖STEM(科学、技术、工程、数学)与人文学科等四大领域。实验结果表明,KNOWLEDGE PILE显著提升了大型语言模型在数学及知识相关推理能力测试中的表现。为促进学术共享,我们开源了数据集与代码,为学术界提供重要支持。