With the advancement of deep learning technologies, general-purpose large models such as GPT-4 have demonstrated exceptional capabilities across various domains. Nevertheless, there remains a demand for high-quality, domain-specific outputs in areas like healthcare, law, and finance. This paper first evaluates the existing large models for specialized domains and discusses their limitations. To cater to the specific needs of certain domains, we introduce the ``MiChao-HuaFen 1.0'' pre-trained corpus dataset, tailored for the news and governmental sectors. The dataset, sourced from publicly available internet data from 2022, underwent multiple rounds of cleansing and processing to ensure high quality and reliable origins, with provisions for consistent and stable updates. This dataset not only supports the pre-training of large models for Chinese vertical domains but also aids in propelling deep learning research and applications in related fields.
翻译:随着深度学习技术的发展,GPT-4等通用大模型已在多个领域展现出卓越能力。然而,在医疗、法律、金融等特定领域中,仍存在对高质量、领域专用输出的需求。本文首先评估了现有面向专业领域的大模型,并讨论了其局限性。为满足特定领域的实际需求,我们提出了“MiChao-HuaFen 1.0”预训练语料数据集,该数据集专为新闻及政务领域设计。该数据集源自2022年公开互联网数据,经过多轮清洗与处理,确保数据质量高且来源可溯,并支持持续稳定的更新。该数据集不仅可为中文垂直领域大模型的预训练提供支撑,也有助于推动相关领域的深度学习研究与应用。