Learnable embedding vector is one of the most important applications in machine learning, and is widely used in various database-related domains. However, the high dimensionality of sparse data in recommendation tasks and the huge volume of corpus in retrieval-related tasks lead to a large memory consumption of the embedding table, which poses a great challenge to the training and deployment of models. Recent research has proposed various methods to compress the embeddings at the cost of a slight decrease in model quality or the introduction of other overheads. Nevertheless, the relative performance of these methods remains unclear. Existing experimental comparisons only cover a subset of these methods and focus on limited metrics. In this paper, we perform a comprehensive comparative analysis and experimental evaluation of embedding compression. We introduce a new taxonomy that categorizes these techniques based on their characteristics and methodologies, and further develop a modular benchmarking framework that integrates 14 representative methods. Under a uniform test environment, our benchmark fairly evaluates each approach, presents their strengths and weaknesses under different memory budgets, and recommends the best method based on the use case. In addition to providing useful guidelines, our study also uncovers the limitations of current methods and suggests potential directions for future research.
翻译:可学习嵌入向量是机器学习中最重要的应用之一,被广泛用于各类数据库相关领域。然而,推荐任务中稀疏数据的高维度以及检索相关任务中语料库的庞大规模,导致嵌入表占用大量内存,给模型的训练和部署带来了巨大挑战。近期研究提出了多种以模型质量略微下降或引入其他开销为代价来压缩嵌入的方法。尽管如此,这些方法的相对性能仍不明确。现有实验比较仅涵盖其中部分方法,且侧重于有限的评估指标。本文对嵌入压缩进行了全面的比较分析和实验评估。我们提出了一种新的分类体系,根据这些技术的特性与方法对其进行归类,并进一步开发了一个模块化基准测试框架,集成了14种代表性方法。在统一的测试环境下,我们的基准测试公正地评估了每种方法,展示了它们在不同内存预算下的优缺点,并根据使用场景推荐了最佳方法。除了提供实用指南,我们的研究还揭示了当前方法的局限性,并指出了未来研究的潜在方向。