Large language models (LLMs) have demonstrated great potential in natural language processing tasks within the financial domain. In this work, we present a Chinese Financial Generative Pre-trained Transformer framework, named CFGPT, which includes a dataset~(CFData) for pre-training and supervised fine-tuning, a financial LLM~(CFLLM) to adeptly manage financial texts, and a deployment framework~(CFAPP) designed to navigate real-world financial applications. The CFData comprising both a pre-training dataset and a supervised fine-tuning dataset, where the pre-training dataset collates Chinese financial data and analytics, alongside a smaller subset of general-purpose text with 584M documents and 141B tokens in total, and the supervised fine-tuning dataset is tailored for six distinct financial tasks, embodying various facets of financial analysis and decision-making with 1.5M instruction pairs and 1.5B tokens in total. The CFLLM, which is based on InternLM-7B to balance the model capability and size, is trained on CFData in two stage, continued pre-training and supervised fine-tuning. The CFAPP is centered on large language models (LLMs) and augmented with additional modules to ensure multifaceted functionality in real-world application. Our codes are released at https://github.com/TongjiFinLab/CFGPT.
翻译:大语言模型(LLMs)在金融领域的自然语言处理任务中展现出巨大潜力。本文提出一个中文金融生成式预训练Transformer框架——CFGPT,包含用于预训练和监督微调的数据集(CFData)、能高效处理金融文本的金融领域大语言模型(CFLLM),以及面向实际金融应用场景设计的部署框架(CFAPP)。CFData由预训练数据集和监督微调数据集构成:预训练数据集整合了中文金融数据与分析资料,并包含少量通用文本,总计5.84亿篇文档与1410亿词元;监督微调数据集针对六类不同金融任务进行定制设计,涵盖金融分析与决策的多个维度,包含150万条指令对及15亿词元。CFLLM以InternLM-7B为基础架构(兼顾模型能力与规模),采用两个阶段在CFData上训练:持续预训练与监督微调。CFAPP以大型语言模型(LLMs)为核心,通过集成额外模块增强功能完整性以支持真实场景应用。我们的代码已开源至https://github.com/TongjiFinLab/CFGPT。