As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports are not only long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Cohere. We find that GPT-3.5 and Cohere fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude seems to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4. The generated summaries and evaluation code are available at https://github.com/ChicagoHAI/characterizing-multimodal-long-form-summarization.
翻译:随着大型语言模型(LLM)扩展了自然语言处理处理长输入的能力,严谨而系统的分析对于理解其能力与行为变得至关重要。摘要生成是一个突出的应用场景,这既因其普遍性,也因其争议性(例如,有研究者曾宣称摘要任务已消亡)。本文以财务报告摘要生成为案例进行研究,因为财务报告不仅篇幅长,而且广泛使用数字与表格。我们提出了一个用于刻画多模态长文档摘要的计算框架,并研究了Claude 2.0/2.1、GPT-4/3.5以及Cohere模型的行为。我们发现GPT-3.5与Cohere无法有效地完成此项摘要任务。针对Claude 2与GPT-4,我们分析了摘要的抽取程度,并识别出LLM中存在的位置偏差。这种位置偏差在将输入内容打乱后对于Claude模型会消失,这表明Claude似乎能够识别重要信息。我们还对LLM生成摘要中数值数据的使用进行了全面调查,并提出了一种数值幻觉的分类体系。我们尝试通过提示工程来改进GPT-4对数字的使用,但效果有限。总体而言,我们的分析突显了Claude 2在处理长篇幅多模态输入方面相较于GPT-4的强大能力。生成的摘要与评估代码可在 https://github.com/ChicagoHAI/characterizing-multimodal-long-form-summarization 获取。