LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule "Delta_sig <= tau predicts non-positive concat cost" classifies 7/9 datasets correctly; since 60% of bootstrap samples place tau in [5, 30] pp, we treat Delta_sig as an interpretive lens rather than a precision filter. A dimension-controlled ablation on PubMed places the LLM-feature drop between same-source PCA (-2.3 pp) and same-dim Gaussian noise (-37.3 pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations fit a power law |Delta_concat| proportional to (sqrt(d_l/n))^1.31 with r^2 = 0.97; the low-Delta_sig, small-n corner is exactly where the headline -17 pp PubMed deficit appears.

翻译：将LLM生成的节点特征通过纯输入拼接（而非联合训练、蒸馏或提示条件化）引入图神经网络（GNN）后，据广泛报道可提升标准基准测试的准确性。然而，我们记录了一个相反的发现：当通过纯输入拼接引入LLM特征时，在那些端到端LLM流水线能够成功处理的同质图基准上，这些特征反而会系统性地降低准确性。采用Planetoid公共划分与词袋原始特征，以MLP为骨干网络，将SBERT编码的GPT-4o-mini TAPE特征进行拼接后，PubMed测试准确率下降-17.0±0.3个百分点，Cora下降-4.3±0.6个百分点（CiteSeer下降-0.6±0.8个百分点，处于种子噪声范围内）。随着我们逐步放宽各项条件（采用GCN/GCNII/GAT骨干网络、随机划分、更小编码器），该下降幅度趋于减弱，并在中等同质性数据集WikiCS（+4.4个百分点）和ogbn-arxiv（+11.7个百分点）上转为正向提升。为预测拼接何时有益或有害，我们报告了一个简单的LLM独立可判别性指标Δ_sig。在9个数据集上，Δ_sig与拼接代价的相关性（点估计值r²=0.38）强于同质性指标（r²=0.06；N=9，自举置信区间重叠）。自举法确定的最佳变化点τ=13.8个百分点，规则“Δ_sig≤τ预测非正拼接代价”正确分类了7/9的数据集；由于60%的自举样本将τ置于[5,30]个百分点区间内，我们将Δ_sig视为解释性视角而非精确过滤器。在PubMed上进行的维度控制消融实验表明，LLM特征导致的下降幅度介于同源PCA（-2.3个百分点）与同维高斯噪声（-37.3个百分点）之间，排除了维度与权重衰减因素的干扰。九个PubMed配置拟合出幂律关系|Δ_concat|∝(√(d_l/n))^1.31，r²=0.97；低Δ_sig、小n的极限情形恰好对应了标题中所述PubMed下降17个百分点的显著赤字。