Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems. A typical model ingests hundreds of features with vocabularies on the order of millions to billions of tokens. The standard approach is to represent each feature value as a d-dimensional embedding, introducing hundreds of billions of parameters for extremely high-cardinality features. This bottleneck has led to substantial progress in alternative embedding algorithms. Many of these methods, however, make the assumption that each feature uses an independent embedding table. This work introduces a simple yet highly effective framework, Feature Multiplexing, where one single representation space is used across many different categorical features. Our theoretical and empirical analysis reveals that multiplexed embeddings can be decomposed into components from each constituent feature, allowing models to distinguish between features. We show that multiplexed representations lead to Pareto-optimal parameter-accuracy tradeoffs for three public benchmark datasets. Further, we propose a highly practical approach called Unified Embedding with three major benefits: simplified feature configuration, strong adaptation to dynamic data distributions, and compatibility with modern hardware. Unified embedding gives significant improvements in offline and online metrics compared to highly competitive baselines across five web-scale search, ads, and recommender systems, where it serves billions of users across the world in industry-leading products.
翻译:高效且有效地学习高质量特征嵌入对于网络规模机器学习系统的性能至关重要。典型模型需处理数百个特征,其词表规模可达数百万至数十亿个词元。标准方法是将每个特征值表示为d维嵌入,对于极高基数的特征会引入数千亿参数。这一瓶颈推动了替代嵌入算法的重大进展。然而,许多方法假设每个特征使用独立的嵌入表。本文提出一个简单而高效的框架——特征复用,将单个表示空间用于多个不同类别特征。我们的理论与实证分析表明,复用嵌入可分解为各组成特征的分量,使得模型能够区分不同特征。我们证明,复用表示在三个公开基准数据集上实现了帕累托最优的参数-精度权衡。此外,我们提出一种名为统一嵌入的高度实用方法,具有三大优势:简化的特征配置、对动态数据分布的强适应性,以及与当代硬件的兼容性。相较于五个网络级搜索、广告和推荐系统中的高度竞争基线,统一嵌入在离线和在线指标上均取得显著提升,目前已在全球领先产品中服务于数十亿用户。