As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports not only are long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Command. We find that GPT-3.5 and Command fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude has the ability to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4.
翻译:随着大语言模型(LLMs)扩展自然语言处理能力以处理长文本输入,亟需严谨的系统性分析来理解其能力与行为特征。摘要生成作为兼具普适性与争议性的典型应用场景(例如,研究者已宣称摘要技术走向终结),成为重要研究对象。本文以金融报告摘要为案例——此类报告不仅篇幅冗长,更大量涉及数值与表格信息。我们提出面向多模态长文档摘要的表征计算框架,系统研究Claude 2.0/2.1、GPT-4/3.5及Command模型的行为表现。实验发现,GPT-3.5与Command无法有效完成摘要任务。针对Claude 2与GPT-4,我们分析了摘要的抽取特性并揭示LLMs存在的立场偏差现象。该偏差在随机排列Claude输入文本后消失,表明Claude具备识别关键信息的能力。我们进一步对LLM生成摘要中的数值使用进行系统研究,提出数值幻觉分类体系。通过提示工程优化GPT-4数值处理能力,但改进效果有限。总体而言,本研究发现Claude 2相较GPT-4在处理长文本多模态输入方面展现出显著优势。