Learnable embedding vector is one of the most important applications in machine learning, and is widely used in various database-related domains. However, the high dimensionality of sparse data in recommendation tasks and the huge volume of corpus in retrieval-related tasks lead to a large memory consumption of the embedding table, which poses a great challenge to the training and deployment of models. Recent research has proposed various methods to compress the embeddings at the cost of a slight decrease in model quality or the introduction of other overheads. Nevertheless, the relative performance of these methods remains unclear. Existing experimental comparisons only cover a subset of these methods and focus on limited metrics. In this paper, we perform a comprehensive comparative analysis and experimental evaluation of embedding compression. We introduce a new taxonomy that categorizes these techniques based on their characteristics and methodologies, and further develop a modular benchmarking framework that integrates 14 representative methods. Under a uniform test environment, our benchmark fairly evaluates each approach, presents their strengths and weaknesses under different memory budgets, and recommends the best method based on the use case. In addition to providing useful guidelines, our study also uncovers the limitations of current methods and suggests potential directions for future research.
翻译:可学习嵌入向量是机器学习中最重要的应用之一,广泛应用于各类数据库相关领域。然而,推荐任务中稀疏数据的高维度特性以及检索任务中语料库的庞大体量,导致嵌入表占用大量内存,给模型的训练与部署带来巨大挑战。近年研究提出了多种嵌入压缩方法,这些方法以模型质量轻微下降或引入其他开销为代价。尽管如此,这些方法的相对性能仍不明确。现有实验对比仅涵盖部分方法,且聚焦于有限指标。本文对嵌入压缩技术进行了全面的比较分析与实验评估。我们提出了一种基于技术特性与方法的全新分类体系,并进一步开发了集成14种代表性方法的模块化基准测试框架。在统一测试环境下,该基准公平评估了每种方法,展示了不同内存预算下的优劣特性,并根据应用场景推荐最佳方案。除提供实用指南外,本研究还揭示了当前方法的局限性,并为未来研究指出了潜在方向。