使大型语言模型引用行为与人类引用偏好对齐 (Aligning Large Language Model Behavior with Human Citation Preferences)

Most services built on powerful large-scale language models (LLMs) add citations to their output to enhance credibility. Recent research has paid increasing attention to the question of what reference documents to link to outputs. However, how LLMs recognize cite-worthiness and how this process should be controlled remains underexplored. In this study, we focus on what kinds of content LLMs currently tend to cite and how well that behavior aligns with human preferences. We construct a dataset to characterize the relationship between human citation preferences and LLM behavior. Web-derived texts are categorized into eight citation-motivation types, and pairwise citation preferences are exhaustively evaluated across all type combinations to capture fine-grained contrasts. Our results show that humans most frequently seek citations for medical text, and stronger models display a similar tendency. We also find that current models are as much as $27\%$ more likely than humans to add citations to text that is explicitly marked as needing citations on sources such as Wikipedia, and this overemphasis reduces alignment accuracy. Conversely, models systematically underselect numeric sentences (by $-22.6\%$ relative to humans) and sentences containing personal names (by $-20.1\%$), categories for which humans typically demand citations. Furthermore, experiments with Direct Preference Optimization demonstrate that model behavior can be calibrated to better match human citation preferences. We expect this study to provide a foundation for more fine-grained investigations into LLM citation preferences.

翻译：大多数基于强大大规模语言模型（LLM）的服务会在其输出中添加引用以增强可信度。近期研究日益关注应为输出链接何种参考文献的问题。然而，LLM如何识别引用价值以及如何控制这一过程仍未得到充分探索。本研究聚焦于LLM当前倾向于引用何种内容，以及该行为与人类偏好的对齐程度。我们构建了一个数据集来刻画人类引用偏好与LLM行为之间的关系。网络来源文本被归类为八种引用动机类型，并通过穷举评估所有类型组合间的成对引用偏好来捕捉细粒度对比。研究结果表明，人类最常为医学文本寻求引用，而性能更强的模型也表现出相似倾向。我们还发现，当前模型为维基百科等明确标注需要引用来源的文本添加引用的可能性比人类高出$27\%$，这种过度强调会降低对齐精度。相反，模型系统性低估数字句（相对人类低$-22.6\%$）和含有人名的句子（相对人类低$-20.1\%$）的引用需求，而这两类正是人类通常要求引用的范畴。此外，通过直接偏好优化实验证明，模型行为可被校准以更好地匹配人类引用偏好。我们期望本研究能为更精细探究LLM引用偏好奠定基础。