The emergence of Large Language Models (LLMs), such as ChatGPT, has revolutionized general natural language preprocessing (NLP) tasks. However, their expertise in the financial domain lacks a comprehensive evaluation. To assess the ability of LLMs to solve financial NLP tasks, we present FinLMEval, a framework for Financial Language Model Evaluation, comprising nine datasets designed to evaluate the performance of language models. This study compares the performance of encoder-only language models and the decoder-only language models. Our findings reveal that while some decoder-only LLMs demonstrate notable performance across most financial tasks via zero-shot prompting, they generally lag behind the fine-tuned expert models, especially when dealing with proprietary datasets. We hope this study provides foundation evaluations for continuing efforts to build more advanced LLMs in the financial domain.
翻译:大型语言模型(LLMs)的出现,如ChatGPT,已彻底革新了通用自然语言处理(NLP)任务。然而,它们在金融领域的专业知识尚缺乏全面评估。为评估LLMs解决金融NLP任务的能力,我们提出了FinLMEval——一个金融语言模型评估框架,包含九个旨在评估语言模型性能的数据集。本研究对比了仅编码器语言模型与仅解码器语言模型的性能。我们的研究结果显示,尽管某些仅解码器LLMs通过零样本提示在大多数金融任务中展现出显著性能,但它们通常落后于微调后的专家模型,尤其在处理专有数据集时。我们期望本研究能为持续构建更先进的金融领域LLMs提供基础评估。