Factorization machine (FM) variants are widely used for large scale real-time content recommendation systems, since they offer an excellent balance between model accuracy and low computational costs for training and inference. These systems are trained on tabular data with both numerical and categorical columns. Incorporating numerical columns poses a challenge, and they are typically incorporated using a scalar transformation or binning, which can be either learned or chosen a-priori. In this work, we provide a systematic and theoretically-justified way to incorporate numerical features into FM variants by encoding them into a vector of function values for a set of functions of one's choice. We view factorization machines as approximators of segmentized functions, namely, functions from a field's value to the real numbers, assuming the remaining fields are assigned some given constants, which we refer to as the segment. From this perspective, we show that our technique yields a model that learns segmentized functions of the numerical feature spanned by the set of functions of one's choice, namely, the spanning coefficients vary between segments. Hence, to improve model accuracy we advocate the use of functions known to have strong approximation power, and offer the B-Spline basis due to its well-known approximation power, availability in software libraries, and efficiency. Our technique preserves fast training and inference, and requires only a small modification of the computational graph of an FM model. Therefore, it is easy to incorporate into an existing system to improve its performance. Finally, we back our claims with a set of experiments, including synthetic, performance evaluation on several data-sets, and an A/B test on a real online advertising system which shows improved performance.
翻译:因子分解机(FM)及其变体广泛应用于大规模实时内容推荐系统,因其在模型精度与训练/推理低计算成本之间实现了卓越平衡。此类系统基于包含数值列与类别列的表格数据进行训练。数值列的引入具有挑战性,通常采用标量变换或分箱方法(可学习或预定义)。本研究提出一种系统且理论完备的方法,通过将数值特征编码为选定函数集合的函数值向量,将其融入FM变体。我们将因子分解机视为分段函数的近似器——即在给定其他字段为固定常数(称为分段)时,将字段值映射至实数的函数。基于此视角,我们证明所提技术生成的模型能够学习由选定函数张成的数值特征分段函数,其张成系数在各分段间动态变化。因此,为提升模型精度,我们推荐使用具有强大逼近能力的函数,并具体提出B样条基函数,因其具备公认的逼近能力、软件库易得性及计算高效性。本技术保持快速训练与推理,仅需对FM模型计算图进行微小修改,可轻松集成至现有系统以提升性能。最后,我们通过合成数据实验、多数据集性能评估及真实在线广告系统A/B测试验证了方法的有效性,结果表明性能显著提升。