Modern industrial recommender systems use a deep ranking model to score N candidates against the same user and context features. Standard implementations broadcast context features early in the forward pass, redundantly computing context-only operations N times per request. We present a rank-aware decomposition applicable to the dominant interaction mechanisms in modern recommender architectures-Factorization Machine (FM) pairwise products, Deep Cross Network (DCNv2) cross layers, self-attention, and fully connected (FC) projection layers-built on a single algebraic principle: any linear or bilinear operation over a rank-partitioned input admits an exact block decomposition that moves context-only computation from once-per-candidate to once-per-request, identity-equivalent to the original model. Closed-form analysis and controlled ablation verify that savings scale quadratically with the number of context features. Applied to a production DLRM-style ranker without any architectural change, the decomposition increases per-pod throughput by 87.5% (a 47% reduction in peak pod count) at identical model predictions. The identity-equivalent decomposition applies only at the first layer of cross networks and self-attention, since each layer mixes ranks in its output. To extend savings across depth, we further introduce rDCN, an architectural variant of DCNv2 that maintains rank discipline across depth and matches DCNv2 accuracy within training noise at 67% fewer total FLOPs, and sketch an analogous architectural variant for self-attention.
翻译:现代工业推荐系统使用深度排序模型,针对相同的用户和上下文特征对N个候选物进行评分。标准实现会在前向传播早期广播上下文特征,每次请求冗余计算N次仅涉及上下文的操作。我们提出一种秩感知分解方法,适用于现代推荐架构中的主要交互机制——因子分解机(FM)的成对乘积、深度交叉网络(DCNv2)的交叉层、自注意力机制以及全连接(FC)投影层。该方法基于单一代数原理:对秩分区输入执行的任何线性或双线性运算,均可实现精确的块分解,将仅涉及上下文的计算从每个候选物一次变为每次请求一次,且与原模型身份等价。闭式分析与受控消融实验验证,节省的计算量随上下文特征数量呈二次方增长。将该分解应用于生产级DLRM风格排序模型(无需任何架构修改)后,在模型预测完全相同的情况下,每pod吞吐量提升87.5%(峰值pod数量减少47%)。由于交叉网络和自注意力机制的每一层输出会混合各秩,该身份等价分解仅适用于其第一层。为跨深度扩展计算节省,我们进一步引入rDCN——DCNv2的一种架构变体,该变体在深度方向上保持秩约束,在训练噪声范围内与DCNv2精度匹配的同时减少67%的总FLOPs,并勾勒出自注意力机制中类似的架构变体方案。