结构上类人，语义上偏置：基于嵌入和图神经网络的大语言模型生成参考文献检测 (Structurally Human, Semantically Biased: Detecting LLM-Generated References with Embeddings and GNNs)

Large language models are increasingly used to curate bibliographies, raising the question: are their reference lists distinguishable from human ones? We build paired citation graphs, ground truth and GPT-4o-generated (from parametric knowledge), for 10,000 focal papers ($\approx$ 275k references) from SciSciNet, and added a field-matched random baseline that preserves out-degree and field distributions while breaking latent structure. We compare (i) structure-only node features (degree/closeness/eigenvector centrality, clustering, edge count) with (ii) 3072-D title/abstract embeddings, using an RF on graph-level aggregates and Graph Neural Networks with node features. Structure alone barely separates GPT from ground truth (RF accuracy $\approx$ 0.60) despite cleanly rejecting the random baseline ($\approx$ 0.89--0.92). By contrast, embeddings sharply increase separability: RF on aggregated embeddings reaches $\approx$ 0.83, and GNNs with embedding node features achieve 93\% test accuracy on GPT vs.\ ground truth. We show the robustness of our findings by replicating the pipeline with Claude Sonnet 4.5 and with multiple embedding models (OpenAI and SPECTER), with RF separability for ground truth vs.\ Claude $\approx 0.77$ and clean rejection of the random baseline. Thus, LLM bibliographies, generated purely from parametric knowledge, closely mimic human citation topology, but leave detectable semantic fingerprints; detection and debiasing should target content signals rather than global graph structure.

翻译：大语言模型日益广泛地应用于文献目录编纂，这引发了一个问题：其生成的参考文献列表与人工编纂的列表是否可区分？我们基于SciSciNet中的10,000篇核心论文（约27.5万条参考文献），构建了配对引文图——包含真实引用图与GPT-4o（基于参数化知识）生成图，并添加了一个领域匹配的随机基线。该基线在保持出度与领域分布的同时，破坏了潜在的图结构。我们比较了两种特征：（i）仅基于结构的节点特征（度中心性/接近中心性/特征向量中心性、聚类系数、边数量）与（ii）3072维的标题/摘要嵌入向量。评估方法包括基于图级聚合特征的随机森林，以及使用节点特征的图神经网络。仅使用结构特征时，模型难以区分GPT生成图与真实图（随机森林准确率约0.60），尽管能清晰拒绝随机基线（准确率约0.89–0.92）。相比之下，嵌入特征显著提升了可分离性：基于聚合嵌入向量的随机森林准确率达到约0.83，而采用嵌入节点特征的图神经网络在GPT生成图与真实图的二分类测试中达到了93%的准确率。我们通过使用Claude Sonnet 4.5模型以及多种嵌入模型（OpenAI和SPECTER）复现实验流程，验证了研究结果的稳健性：真实图与Claude生成图的随机森林可分离性约为0.77，且能清晰拒绝随机基线。因此，完全基于参数化知识生成的大语言模型参考文献列表，在引用拓扑结构上高度模仿人工编纂，但留下了可检测的语义指纹；检测与去偏方法应聚焦于内容信号，而非全局图结构。