As we embark on a new era of LLMs, it becomes increasingly crucial to understand their capabilities, limitations, and differences. Toward making further progress in this direction, we strive to build a deeper understanding of the gaps between massive LLMs (e.g., ChatGPT) and smaller yet effective open-source LLMs and their distilled counterparts. To this end, we specifically focus on long-form question answering (LFQA) because it has several practical and impactful applications (e.g., troubleshooting, customer service, etc.) yet is still understudied and challenging for LLMs. We propose a question-generation method from abstractive summaries and show that generating follow-up questions from summaries of long documents can create a challenging setting for LLMs to reason and infer from long contexts. Our experimental results confirm that: (1) our proposed method of generating questions from abstractive summaries pose a challenging setup for LLMs and shows performance gaps between LLMs like ChatGPT and open-source LLMs (Alpaca, Llama) (2) open-source LLMs exhibit decreased reliance on context for generated questions from the original document, but their generation capabilities drop significantly on generated questions from summaries -- especially for longer contexts (>1024 tokens)
翻译:随着我们进入大型语言模型(LLMs)的新时代,理解其能力、局限性和差异变得日益关键。为推进这一研究方向,我们致力于深入理解大规模LLMs(如ChatGPT)与较小但有效的开源LLMs及其蒸馏版本之间的差距。为此,我们特别关注长文本问答(LFQA)任务,因为它在故障排除、客户服务等多个实际应用中具有重要价值,但目前对LLMs而言仍研究不足且充满挑战。我们提出了一种基于抽象式摘要的问题生成方法,并证明从长文档摘要生成后续问题可为LLMs创造具有挑战性的推理和长上下文推断场景。实验结果表明:(1)我们提出的从抽象式摘要生成问题的方法为LLMs设置了困难情境,并揭示了ChatGPT等LLMs与开源LLMs(Alpaca、Llama)之间的性能差距;(2)开源LLMs对原始文档生成问题的上下文依赖程度降低,但其在摘要生成问题上的生成能力显著下降——尤其是在上下文长度超过1024个Token时。