Despite Spanish's pivotal role in the global finance industry, a pronounced gap exists in Spanish financial natural language processing (NLP) and application studies compared to English, especially in the era of large language models (LLMs). To bridge this gap, we unveil Tois\'on de Oro, the first bilingual framework that establishes instruction datasets, finetuned LLMs, and evaluation benchmark for financial LLMs in Spanish joint with English. We construct a rigorously curated bilingual instruction dataset including over 144K Spanish and English samples from 15 datasets covering 7 tasks. Harnessing this, we introduce FinMA-ES, an LLM designed for bilingual financial applications. We evaluate our model and existing LLMs using FLARE-ES, the first comprehensive bilingual evaluation benchmark with 21 datasets covering 9 tasks. The FLARE-ES benchmark results reveal a significant multilingual performance gap and bias in existing LLMs. FinMA-ES models surpass SOTA LLMs such as GPT-4 in Spanish financial tasks, due to strategic instruction tuning and leveraging data from diverse linguistic resources, highlighting the positive impact of cross-linguistic transfer. All our datasets, models, and benchmarks have been released.
翻译:尽管西班牙语在全球金融行业中具有重要地位,但相较于英语,西班牙语金融自然语言处理(NLP)及应用研究存在显著差距,尤其是在大语言模型(LLMs)时代。为弥合这一差距,我们提出了Toisón de Oro——首个为西班牙语联合英语金融大语言模型建立指令数据集、微调模型及评估基准的双语框架。我们构建了严格策划的双语指令数据集,包含来自15个数据集、覆盖7项任务的14.4万余条西班牙语与英语样本。基于此,我们推出了专为双语金融应用设计的大语言模型FinMA-ES。通过FLARE-ES(首个覆盖9项任务、包含21个数据集的综合双语评估基准),我们对模型及现有大语言模型进行了评测。FLARE-ES基准结果显示,现有大语言模型存在显著的多语言性能差距及偏见。FinMA-ES模型凭借策略性指令微调及对多语言数据资源的利用,在西班牙语金融任务中超越了GPT-4等最先进大语言模型,凸显了跨语言迁移的积极影响。我们的全部数据集、模型及基准均已公开发布。