Many modern machine learning (ML) methods rely on embedding models to learn vector representations (embeddings) for a set of entities (embedding tables). As increasingly diverse ML applications utilize embedding models and embedding tables continue to grow in size and number, there has been a surge in the ad-hoc development of specialized frameworks targeted to train large embedding models for specific tasks. Although the scalability issues that arise in different embedding model training tasks are similar, each of these frameworks independently reinvents and customizes storage components for specific tasks, leading to substantial duplicated engineering efforts in both development and deployment. This paper presents MLKV, an efficient, extensible, and reusable data storage framework designed to address the scalability challenges in embedding model training, specifically data stall and staleness. MLKV augments disk-based key-value storage by democratizing optimizations that were previously exclusive to individual specialized frameworks and provides easy-to-use interfaces for embedding model training tasks. Extensive experiments on open-source workloads, as well as applications in eBay's payment transaction risk detection and seller payment risk detection, show that MLKV outperforms offloading strategies built on top of industrial-strength key-value stores by 1.6-12.6x. MLKV is open-source at https://github.com/llm-db/MLKV.
翻译:许多现代机器学习方法依赖嵌入模型来学习实体集合的向量表示。随着日益多样化的机器学习应用采用嵌入模型,且嵌入表的规模和数量持续增长,针对特定任务训练大规模嵌入模型的专用框架呈现爆发式开发。尽管不同嵌入模型训练任务中出现的可扩展性问题具有相似性,但这些框架各自独立地重构并定制面向特定任务的存储组件,导致开发与部署过程中产生大量重复性工程工作。本文提出MLKV,一个高效、可扩展且可复用的数据存储框架,旨在解决嵌入模型训练中的可扩展性挑战,特别是数据停滞与陈旧性问题。MLKV通过将原本专属于个别定制框架的优化技术通用化,增强了基于磁盘的键值存储能力,并为嵌入模型训练任务提供易用的接口。在开源工作负载以及eBay支付交易风险检测和卖家支付风险检测应用中的大量实验表明,MLKV的性能超越基于工业级键值存储构建的卸载策略1.6-12.6倍。MLKV已在https://github.com/llm-db/MLKV开源。