Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule "Delta_sig <= tau predicts non-positive concat cost" classifies 7/9 datasets correctly; since 60% of bootstrap samples place tau in [5, 30] pp, we treat Delta_sig as an interpretive lens rather than a precision filter. A dimension-controlled ablation on PubMed places the LLM-feature drop between same-source PCA (-2.3 pp) and same-dim Gaussian noise (-37.3 pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations fit a power law |Delta_concat| proportional to (sqrt(d_l/n))^1.31 with r^2 = 0.97; the low-Delta_sig, small-n corner is exactly where the headline -17 pp PubMed deficit appears.
翻译:将LLM生成的节点特征通过纯输入拼接(而非联合训练、蒸馏或提示条件化)引入图神经网络(GNN)后,据广泛报道可提升标准基准测试的准确性。然而,我们记录了一个相反的发现:当通过纯输入拼接引入LLM特征时,在那些端到端LLM流水线能够成功处理的同质图基准上,这些特征反而会系统性地降低准确性。采用Planetoid公共划分与词袋原始特征,以MLP为骨干网络,将SBERT编码的GPT-4o-mini TAPE特征进行拼接后,PubMed测试准确率下降-17.0±0.3个百分点,Cora下降-4.3±0.6个百分点(CiteSeer下降-0.6±0.8个百分点,处于种子噪声范围内)。随着我们逐步放宽各项条件(采用GCN/GCNII/GAT骨干网络、随机划分、更小编码器),该下降幅度趋于减弱,并在中等同质性数据集WikiCS(+4.4个百分点)和ogbn-arxiv(+11.7个百分点)上转为正向提升。为预测拼接何时有益或有害,我们报告了一个简单的LLM独立可判别性指标Δ_sig。在9个数据集上,Δ_sig与拼接代价的相关性(点估计值r²=0.38)强于同质性指标(r²=0.06;N=9,自举置信区间重叠)。自举法确定的最佳变化点τ=13.8个百分点,规则“Δ_sig≤τ预测非正拼接代价”正确分类了7/9的数据集;由于60%的自举样本将τ置于[5,30]个百分点区间内,我们将Δ_sig视为解释性视角而非精确过滤器。在PubMed上进行的维度控制消融实验表明,LLM特征导致的下降幅度介于同源PCA(-2.3个百分点)与同维高斯噪声(-37.3个百分点)之间,排除了维度与权重衰减因素的干扰。九个PubMed配置拟合出幂律关系|Δ_concat|∝(√(d_l/n))^1.31,r²=0.97;低Δ_sig、小n的极限情形恰好对应了标题中所述PubMed下降17个百分点的显著赤字。