The application of large language models (LLMs) in domain-specific contexts, including finance, has expanded rapidly. Domain-specific LLMs are typically evaluated based on their performance in various downstream tasks relevant to the domain. In this work, we present a detailed analysis of fine-tuning LLMs for such tasks. Somewhat counterintuitively, we find that in domain-specific cases, fine-tuning exclusively on the target task is not always the most effective strategy. Instead, multi-task fine-tuning - where models are trained on a cocktail of related tasks - can significantly enhance performance. We demonstrate how this approach enables a small model, such as Phi-3-Mini, to achieve state-of-the-art results, even surpassing the much larger GPT-4-o model on financial benchmarks. Our study involves a large-scale experiment, training over 200 models using several widely adopted LLMs as baselines, and empirically confirms the benefits of multi-task fine-tuning. Additionally, we explore the use of general instruction data as a form of regularization, suggesting that it helps minimize performance degradation. We also investigate the inclusion of mathematical data, finding improvements in numerical reasoning that transfer effectively to financial tasks. Finally, we note that while fine-tuning for downstream tasks leads to targeted improvements in task performance, it does not necessarily result in broader gains in domain knowledge or complex domain reasoning abilities.
翻译:大型语言模型(LLM)在金融等特定领域的应用迅速扩展。领域专用LLM通常根据其在相关领域各类下游任务中的表现进行评估。本研究对面向此类任务的LLM微调进行了详细分析。我们发现,在特定领域场景中,仅针对目标任务进行微调并非总是最有效的策略,这一结论与直觉有所相悖。相反,多任务微调——即在相关任务混合组成的"鸡尾酒"式数据集上训练模型——能够显著提升性能。我们展示了该方法如何使Phi-3-Mini等小型模型取得最先进的结果,甚至在金融基准测试中超越规模更大的GPT-4-o模型。本研究通过大规模实验,以多个广泛采用的LLM作为基线训练了200多个模型,实证验证了多任务微调的优势。此外,我们探索了将通用指令数据作为正则化手段的应用,表明其有助于最小化性能衰减。我们还研究了数学数据的纳入,发现数值推理能力的提升能有效迁移至金融任务。最后需要指出,虽然针对下游任务的微调能带来任务性能的定向提升,但未必能获得领域知识或复杂领域推理能力的广泛增益。