This paper explores effective numerical feature embedding for Click-Through Rate prediction in streaming environments. Conventional static binning methods rely on offline statistics of numerical distributions; however, this inherently two-stage process often triggers semantic drift during bin boundary updates. While neural embedding methods enable end-to-end learning, they often discard explicit distributional information. Integrating such information end-to-end is challenging because streaming features often violate the i.i.d. assumption, precluding unbiased estimation of the population distribution via the expectation of order statistics. Furthermore, the critical context dependency of numerical distributions is often neglected. To this end, we propose DAES, an end-to-end framework designed to tackle numerical feature embedding in streaming training scenarios by integrating distributional information with an adaptive modulation mechanism. Specifically, we introduce an efficient reservoir-sampling-based distribution estimation method and two field-aware distribution modulation strategies to capture streaming distributions and field-dependent semantics. DAES significantly outperforms existing approaches as demonstrated by extensive offline and online experiments and has been fully deployed on a leading short-video platform with hundreds of millions of daily active users.
翻译:本文探讨了流式环境下点击率预测中有效的数值特征嵌入方法。传统的静态分箱方法依赖于数值分布的离线统计;然而,这种固有的两阶段过程在分箱边界更新时常常引发语义漂移。虽然神经嵌入方法支持端到端学习,但它们往往丢弃了显式的分布信息。在端到端框架中整合此类信息具有挑战性,因为流式特征常常违反独立同分布假设,从而无法通过顺序统计量的期望对总体分布进行无偏估计。此外,数值分布的关键上下文依赖性也常被忽视。为此,我们提出了DAES,一个端到端框架,旨在通过将分布信息与自适应调制机制相结合,解决流式训练场景中的数值特征嵌入问题。具体而言,我们引入了一种高效的基于蓄水池采样的分布估计方法和两种字段感知的分布调制策略,以捕捉流式分布和字段依赖的语义。大量离线和在线实验表明,DAES显著优于现有方法,并已全面部署于一个拥有数亿日活跃用户的领先短视频平台。