The application of large language models (LLMs) in domain-specific contexts, including finance, has expanded rapidly. Domain-specific LLMs are typically evaluated based on their performance in various downstream tasks relevant to the domain. In this work, we present a detailed analysis of fine-tuning LLMs for such tasks. Somewhat counterintuitively, we find that in domain-specific cases, fine-tuning exclusively on the target task is not always the most effective strategy. Instead, multi-task finetuning - where models are trained on a cocktail of related tasks - can significantly enhance performance. We demonstrate how this approach enables a small model, such as Phi-3-Mini, to achieve state-of-the-art results, even surpassing the much larger GPT-4-o model on financial benchmarks. Our study involves a large-scale experiment, conducting over 200 training experiments using several widely adopted LLMs as baselines, and empirically confirms the benefits of multi-task fine-tuning. Additionally, we explore the use of general instruction data as a form of regularization, suggesting that it helps minimize performance degradation. We also investigate the inclusion of mathematical data, finding improvements in numerical reasoning that transfer effectively to financial tasks. Finally, we note that while fine-tuning for downstream tasks leads to targeted improvements in task performance, it does not necessarily result in broader gains in domain knowledge or complex domain reasoning abilities.
翻译:大型语言模型(LLM)在包括金融在内的特定领域应用迅速扩展。领域专用LLM通常根据其在相关领域各类下游任务中的表现进行评估。本研究针对此类任务的LLM微调进行了详细分析。我们发现,在特定领域场景中,仅针对目标任务进行微调并非总是最有效的策略,这一结果与直觉相悖。相反,多任务微调——即使用相关任务组合对模型进行训练——能够显著提升性能。我们展示了这种方法如何使Phi-3-Mini等小型模型在金融基准测试中达到最先进水平,甚至超越规模更大的GPT-4-o模型。本研究通过大规模实验,以多个广泛采用的LLM作为基线进行了超过200次训练实验,实证验证了多任务微调的优势。此外,我们探索了将通用指令数据作为正则化手段的应用,表明其有助于减少性能衰减。我们还研究了数学数据的引入,发现其在数值推理方面的改进能有效迁移至金融任务。最后需要指出,虽然针对下游任务的微调能带来任务性能的定向提升,但未必能促进领域知识或复杂领域推理能力的广泛增强。