The rapid advancement of Large Language Models (LLMs) has led to extensive discourse regarding their potential to boost the return of quantitative stock trading strategies. This discourse primarily revolves around harnessing the remarkable comprehension capabilities of LLMs to extract sentiment factors which facilitate informed and high-frequency investment portfolio adjustments. To ensure successful implementations of these LLMs into the analysis of Chinese financial texts and the subsequent trading strategy development within the Chinese stock market, we provide a rigorous and encompassing benchmark as well as a standardized back-testing framework aiming at objectively assessing the efficacy of various types of LLMs in the specialized domain of sentiment factor extraction from Chinese news text data. To illustrate how our benchmark works, we reference three distinctive models: 1) the generative LLM (ChatGPT), 2) the Chinese language-specific pre-trained LLM (Erlangshen-RoBERTa), and 3) the financial domain-specific fine-tuned LLM classifier(Chinese FinBERT). We apply them directly to the task of sentiment factor extraction from large volumes of Chinese news summary texts. We then proceed to building quantitative trading strategies and running back-tests under realistic trading scenarios based on the derived sentiment factors and evaluate their performances with our benchmark. By constructing such a comparative analysis, we invoke the question of what constitutes the most important element for improving a LLM's performance on extracting sentiment factors. And by ensuring that the LLMs are evaluated on the same benchmark, following the same standardized experimental procedures that are designed with sufficient expertise in quantitative trading, we make the first stride toward answering such a question.
翻译:大语言模型(LLMs)的快速发展引发了广泛讨论,探讨其提升量化股票交易策略收益的潜力。这一讨论主要围绕利用LLMs卓越的理解能力提取情感因子,从而促进信息充分且高频的投资组合调整。为确保LLMs成功应用于中文金融文本分析及后续中国股市交易策略开发,我们构建了一个严谨全面的基准测试及标准化回测框架,旨在客观评估各类LLMs在从中文新闻文本数据中提取情感因子这一专业领域中的效能。为阐述基准测试的运行机制,我们参考了三种代表性模型:1)生成式LLM(ChatGPT),2)中文专用预训练LLM(Erlangshen-RoBERTa),3)金融领域微调LLM分类器(Chinese FinBERT)。我们将其直接应用于从大量中文新闻摘要文本中提取情感因子的任务,进而基于所提取的情感因子构建量化交易策略,并在真实交易场景下进行回测,利用我们的基准评估其表现。通过此类比较分析,我们提出一个问题:提升LLM在情感因子提取性能中,最关键的因素是什么?通过确保LLMs在同一基准下、遵循同样具备量化交易专业知识的标准化实验流程进行评估,我们为解答该问题迈出了第一步。