In this work, we investigate the controllability of large language models (LLMs) on scientific summarization tasks. We identify key stylistic and content coverage factors that characterize different types of summaries such as paper reviews, abstracts, and lay summaries. By controlling stylistic features, we find that non-fine-tuned LLMs outperform humans in the MuP review generation task, both in terms of similarity to reference summaries and human preferences. Also, we show that we can improve the controllability of LLMs with keyword-based classifier-free guidance (CFG) while achieving lexical overlap comparable to strong fine-tuned baselines on arXiv and PubMed. However, our results also indicate that LLMs cannot consistently generate long summaries with more than 8 sentences. Furthermore, these models exhibit limited capacity to produce highly abstractive lay summaries. Although LLMs demonstrate strong generic summarization competency, sophisticated content control without costly fine-tuning remains an open problem for domain-specific applications.
翻译:在本研究中,我们探究了大型语言模型(LLMs)在科学摘要任务中的可控性。我们识别了表征不同类型摘要(如论文评审、摘要和科普摘要)的关键风格与内容覆盖因素。通过控制风格特征,我们发现未经微调的LLMs在MuP评审生成任务中表现优于人类,无论是在与参考摘要的相似度还是人类偏好方面。此外,我们证明了可以通过基于关键词的无分类器引导(CFG)来提升LLMs的可控性,同时在arXiv和PubMed数据集上达到与强微调基线相当的词汇重叠度。然而,我们的结果也表明,LLMs无法持续生成长度超过8句的摘要。此外,这些模型在生成高度抽象化的科普摘要方面能力有限。尽管LLMs展现出强大的通用摘要能力,但在不进行昂贵微调的情况下实现精细的内容控制,对于特定领域的应用而言仍是一个待解决的问题。