As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports not only are long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Command. We find that GPT-3.5 and Command fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude has the ability to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4.
翻译:随着大型语言模型(LLM)扩展自然语言处理能力以处理长输入,需要严格系统的分析来理解其能力和行为。摘要因普遍存在且具争议性(例如,研究者已宣称摘要的终结)而成为突出应用。本文以财务报告摘要为案例,因为财务报告不仅篇幅长,还大量使用数字和表格。我们提出一个用于表征多模态长文本摘要的计算框架,并研究Claude 2.0/2.1、GPT-4/3.5和Command的行为。发现GPT-3.5和Command无法有效执行此摘要任务。对于Claude 2和GPT-4,我们分析了摘要的提取性,并识别出LLM中的位置偏差。该位置偏差在随机打乱Claude的输入后消失,表明Claude具备识别重要信息的能力。我们还对LLM生成摘要中数值数据的使用进行了全面研究,并提出了数值幻觉的分类体系。我们采用提示工程优化GPT-4的数字使用能力,但成效有限。总体而言,与GPT-4相比,Claude 2在处理长多模态输入方面展现出更强能力。