MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage

Many modern machine learning (ML) methods rely on embedding models to learn vector representations (embeddings) for a set of entities (embedding tables). As increasingly diverse ML applications utilize embedding models and embedding tables continue to grow in size and number, there has been a surge in the ad-hoc development of specialized frameworks targeted to train large embedding models for specific tasks. Although the scalability issues that arise in different embedding model training tasks are similar, each of these frameworks independently reinvents and customizes storage components for specific tasks, leading to substantial duplicated engineering efforts in both development and deployment. This paper presents MLKV, an efficient, extensible, and reusable data storage framework designed to address the scalability challenges in embedding model training, specifically data stall and staleness. MLKV augments disk-based key-value storage by democratizing optimizations that were previously exclusive to individual specialized frameworks and provides easy-to-use interfaces for embedding model training tasks. Extensive experiments on open-source workloads, as well as applications in eBay's payment transaction risk detection and seller payment risk detection, show that MLKV outperforms offloading strategies built on top of industrial-strength key-value stores by 1.6-12.6x. MLKV is open-source at https://github.com/llm-db/MLKV.

翻译：许多现代机器学习方法依赖嵌入模型来学习实体集合的向量表示。随着日益多样化的机器学习应用采用嵌入模型，且嵌入表的规模和数量持续增长，针对特定任务训练大规模嵌入模型的专用框架呈现爆发式开发。尽管不同嵌入模型训练任务中出现的可扩展性问题具有相似性，但这些框架各自独立地重构并定制面向特定任务的存储组件，导致开发与部署过程中产生大量重复性工程工作。本文提出MLKV，一个高效、可扩展且可复用的数据存储框架，旨在解决嵌入模型训练中的可扩展性挑战，特别是数据停滞与陈旧性问题。MLKV通过将原本专属于个别定制框架的优化技术通用化，增强了基于磁盘的键值存储能力，并为嵌入模型训练任务提供易用的接口。在开源工作负载以及eBay支付交易风险检测和卖家支付风险检测应用中的大量实验表明，MLKV的性能超越基于工业级键值存储构建的卸载策略1.6-12.6倍。MLKV已在https://github.com/llm-db/MLKV开源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日