We study the ability of large language models (LLMs) to generate comprehensive and accurate book summaries solely from their internal knowledge, without recourse to the original text. Employing a diverse set of books and multiple LLM architectures, we examine whether these models can synthesize meaningful narratives that align with established human interpretations. Evaluation is performed with a LLM-as-a-judge paradigm: each AI-generated summary is compared against a high-quality, human-written summary via a cross-model assessment, where all participating LLMs evaluate not only their own outputs but also those produced by others. This methodology enables the identification of potential biases, such as the proclivity for models to favor their own summarization style over others. In addition, alignment between the human-crafted and LLM-generated summaries is quantified using ROUGE and BERTScore metrics, assessing the depth of grammatical and semantic correspondence. The results reveal nuanced variations in content representation and stylistic preferences among the models, highlighting both strengths and limitations inherent in relying on internal knowledge for summarization tasks. These findings contribute to a deeper understanding of LLM internal encodings of factual information and the dynamics of cross-model evaluation, with implications for the development of more robust natural language generative systems.
翻译:本研究探讨大型语言模型(LLMs)在不依赖原始文本的情况下,仅凭其内部知识生成全面且准确书籍摘要的能力。通过选取多样化的书籍样本并采用多种LLM架构,我们检验这些模型能否合成符合既定人类解读意义的有意义叙事。评估采用LLM作为评判者的范式:每个AI生成的摘要均与高质量人工撰写的摘要进行跨模型比对,所有参与的LLMs不仅评估自身输出,同时评估其他模型生成的摘要。该方法能够识别潜在偏差,例如模型倾向于偏好自身的摘要风格而非其他模型。此外,通过ROUGE和BERTScore指标量化人工撰写摘要与LLM生成摘要之间的对齐程度,从而评估语法与语义对应关系的深度。研究结果揭示了不同模型在内容呈现和风格偏好上的细微差异,凸显了依赖内部知识进行摘要任务时固有的优势与局限。这些发现有助于深化对LLM事实信息内部编码机制及跨模型评估动态的理解,对开发更稳健的自然语言生成系统具有重要启示。