While Large language models (LLMs) have become excellent writing assistants, they still struggle with quotation generation. This is because they either hallucinate when providing factual quotations or fail to provide quotes that exceed human expectations. To bridge the gap, we systematically study how to evaluate and improve LLMs' performance in quotation generation tasks. We first establish a holistic and automatic evaluation system for quotation generation task, which consists of five criteria each with corresponding automatic metric. To improve the LLMs' quotation generation abilities, we construct a bilingual knowledge base that is broad in scope and rich in dimensions, containing up to 32,022 quotes. Moreover, guided by our critiria, we further design a quotation-specific metric to rerank the retrieved quotations from the knowledge base. Extensive experiments show that our metrics strongly correlate with human preferences. Existing LLMs struggle to generate desired quotes, but our quotation knowledge base and reranking metric help narrow this gap. Our dataset and code are publicly available at https://github.com/GraceXiaoo/QUILL.
翻译:尽管大型语言模型已成为卓越的写作助手,其在引文生成方面仍面临挑战。这主要表现为模型在提供事实性引文时易产生幻觉,或难以生成超越人类预期的引文。为弥合这一差距,我们系统研究了如何评估并提升大型语言模型在引文生成任务中的表现。我们首先构建了一套完整且自动化的引文生成评估体系,该体系包含五个评估维度,每个维度均配有相应的自动化度量指标。为增强大型语言模型的引文生成能力,我们构建了一个双语知识库,其涵盖范围广泛、维度丰富,收录多达32,022条引文。此外,基于我们的评估维度,我们进一步设计了针对引文生成的特化度量指标,用于对从知识库检索到的引文进行重排序。大量实验表明,我们的度量指标与人类偏好具有强相关性。现有大型语言模型难以生成理想的引文,而我们的引文知识库与重排序度量指标有助于缩小这一差距。我们的数据集与代码已公开于 https://github.com/GraceXiaoo/QUILL。