Generative pre-training via discrete diffusion provides dense reconstruction supervision across all feature fields simultaneously, mitigating representation collapse from data sparsity in CTR prediction. However, all existing generative CTR methods share a fundamental limitation: the reconstruction objective assigns equal training weight to every feature field, ignoring the profound heterogeneity of reconstruction difficulty across high-cardinality ID fields, sparse categorical attributes, numerical values, and behavioral sequences. This causes easy fields to dominate training gradients while the hardest but most informative fields remain chronically underfit, a problem we term the generative difficulty imbalance.We propose HeteGenCTR, which resolves this imbalance through per-field learnable difficulty parameters jointly trained with the denoising network. This unified signal drives two coordinated components without additional hyperparameters: a self-balancing loss that automatically reallocates gradient budget toward harder fields with a provably stable equilibrium, and a difficulty-guided attention mechanism that suppresses the influence of already-converged easy fields while amplifying cross-field information flow toward hard fields. Both components share the same learned signal and remain mutually consistent throughout training. Experiments on five CTR benchmarks and a seven-day online A/B test demonstrate consistent, statistically significant improvements over state-of-the-art baselines, with disproportionate gains for cold-start and long-tail users.
翻译:通过离散扩散的生成式预训练能同时对所有特征域提供密集的重建监督,从而缓解CTR预测中数据稀疏导致的表示坍缩。然而,所有现有的生成式CTR方法均存在一个根本性局限:重建目标对每个特征域赋予相同的训练权重,忽略了高基数ID域、稀疏类别属性、数值特征及行为序列在重建难度上的深刻异质性。这导致简单域主导训练梯度,而最难但信息量最丰富的域长期处于欠拟合状态——我们将此问题称为生成难度不平衡。我们提出HeteGenCTR,该方法通过每个特征域的可学习难度参数(与去噪网络联合训练)来解决这种不平衡。这一统一信号驱动两个无需额外超参数的协调组件:自平衡损失函数(自动将梯度预算重新分配给更难域,具有可证明的稳定均衡),以及难度引导的注意力机制(抑制已收敛的简单域的影响,同时增强向难域的跨域信息流)。两个组件共享同一学习信号,并在整个训练过程中保持相互一致。在五个CTR基准测试及为期七天的在线A/B测试中,该方法相比最先进的基线取得了一致的、统计显著的改进,对冷启动和长尾用户的提升尤为显著。