The attribution technique enhances the credibility of LLMs by adding citations to the generated sentences, enabling users to trace back to the original sources and verify the reliability of the output. However, existing instruction-tuned attributed LLMs often fail to properly interpret the contextual semantics of citation symbols (e.g., [i]) during text generation. This shortcoming arises from their insufficient awareness of the context information surrounding citation markers, which in turn leads to disjointed references and poor integration of retrieved knowledge into the generated content. To address this issue, we propose a novel \textbf{C}ontextual-aware \textbf{C}itation generation framework (\textbf{C$^2$}-\textbf{Cite}) that explicitly integrates the semantic relationships between citation markers and their referenced content. Specifically, a contextual citation alignment mechanism is adopted: it first encodes the retrieved document contexts into the symbol representation of citations, then aligns the marker numbers by decoding information from a citation router function. This mechanism enables the transformation of citation markers from generic placeholders into active knowledge pointers that link to the referenced source information. Experimental results on the ALCE benchmark across three datasets validate our framework C$^2$-Cite++: it outperforms the SOTA baseline by an average of 5.8\% in citation quality and 17.4\% in response correctness. The implementation is publicly available at https://github.com/BAI-LAB/c2cite
翻译:归因技术通过向生成句子中添加引用来增强大语言模型可信度,使用户能够追溯原始来源并验证输出的可靠性。然而,现有指令微调后的归因大语言模型在文本生成过程中,往往无法正确解释引用符号(例如[i])的上下文语义。这一缺陷源于它们对引用标记周围语境信息的感知不足,进而导致生成的引用内容零散、检索知识难以融入生成内容。为解决该问题,我们提出一个新颖的**上下文感知引用生成框架(C$^2$-Cite)**,该框架显式地整合了引用标记与其引用内容之间的语义关联。具体而言,我们采用了一种上下文引用对齐机制:首先将检索到的文档上下文编码为引用符号的表征,然后通过解码引用路由函数中的信息来对齐标记编号。该机制能够将引用标记从通用占位符转变为关联引用源信息的主动式知识指针。在ALCE基准测试中三个数据集上的实验结果验证了我们的框架C$^2$-Cite++:其引用质量相较于当前最优基线平均提升5.8%,响应正确率平均提升17.4%。实现代码已在https://github.com/BAI-LAB/c2cite公开发布。