Accurate evaluation of financial question answering (QA) systems necessitates a comprehensive dataset encompassing diverse question types and contexts. However, current financial QA datasets lack scope diversity and question complexity. This work introduces FinTextQA, a novel dataset for long-form question answering (LFQA) in finance. FinTextQA comprises 1,262 high-quality, source-attributed QA pairs extracted and selected from finance textbooks and government agency websites.Moreover, we developed a Retrieval-Augmented Generation (RAG)-based LFQA system, comprising an embedder, retriever, reranker, and generator. A multi-faceted evaluation approach, including human ranking, automatic metrics, and GPT-4 scoring, was employed to benchmark the performance of different LFQA system configurations under heightened noisy conditions. The results indicate that: (1) Among all compared generators, Baichuan2-7B competes closely with GPT-3.5-turbo in accuracy score; (2) The most effective system configuration on our dataset involved setting the embedder, retriever, reranker, and generator as Ada2, Automated Merged Retrieval, Bge-Reranker-Base, and Baichuan2-7B, respectively; (3) models are less susceptible to noise after the length of contexts reaching a specific threshold.
翻译:金融问答系统的准确评估需要涵盖多样化问题类型与场景的综合数据集。然而现有金融问答数据集存在领域范围不足和问题复杂度欠缺的问题。本文提出FinTextQA——一个面向金融领域的长文本问答(LFQA)数据集。该数据集包含从金融教科书和政府机构网站中提取筛选的1,262个高质量、带来源标注的问答对。此外,我们开发了基于检索增强生成(RAG)的LFQA系统,该系统由嵌入器、检索器、重排序器和生成器构成。我们采用包含人工排序、自动评估指标和GPT-4评分的多维度评估方法,在增强噪声条件下对不同的LFQA系统配置进行基准测试。实验结果表明:(1)在所有对比生成器中,Baichuan2-7B在准确率得分上与GPT-3.5-turbo表现相当;(2)针对本数据集的最优系统配置为:嵌入器采用Ada2,检索器采用自动合并检索,重排序器采用Bge-Reranker-Base,生成器采用Baichuan2-7B;(3)当上下文长度超过特定阈值后,模型对噪声的敏感度显著降低。