Large Language models (LLMs) usually rely on extensive training datasets. In the financial domain, creating numerical reasoning datasets that include a mix of tables and long text often involves substantial manual annotation expenses. To address the limited data resources and reduce the annotation cost, we introduce FinLLMs, a method for generating financial question-answering data based on common financial formulas using Large Language Models. First, we compile a list of common financial formulas and construct a graph based on the variables these formulas employ. We then augment the formula set by combining those that share identical variables as new elements. Specifically, we explore formulas obtained by manual annotation and merge those formulas with shared variables by traversing the constructed graph. Finally, utilizing GPT-3.5, we generate financial question-answering data that encompasses both tabular information and long textual content, building on the collected formula set. Our experiments demonstrate that synthetic data generated by FinLLMs effectively enhances the performance of several large-scale numerical reasoning models in the financial domain, outperforming two established benchmark financial question-answering datasets.
翻译:大语言模型通常依赖大规模训练数据集。在金融领域,构建包含表格与长文本混合形式的数值推理数据集往往需要大量人工标注成本。为解决数据资源有限且降低标注成本的问题,我们提出FinLLMs——一种基于常用金融公式的大语言模型金融问答数据生成方法。首先,我们收集常用金融公式列表,并根据这些公式所涉及的变量构建关联图谱。随后通过合并共享相同变量的公式来扩充公式集,具体而言,通过遍历构建的图谱探索人工标注获取的公式,将共享变量的公式合并为新的组合公式。最后,基于收集的公式集,利用GPT-3.5生成包含表格信息与长文本内容的金融问答数据。实验表明,FinLLMs生成的合成数据能有效提升多个大规模数值推理模型在金融领域的性能,优于两个成熟的金融问答基准数据集。