The past two decades have witnessed columnar storage revolutionizing data warehousing and analytics. However, the rapid growth of machine learning poses new challenges to this domain. This paper presents Bullion, a columnar storage system tailored for machine learning workloads. Bullion addresses the complexities of data compliance, optimizes the encoding of long sequence sparse features, efficiently manages wide-table projections, and introduces feature quantization in storage. By aligning with the evolving requirements of ML applications, Bullion extends columnar storage to various scenarios, from advertising and recommendation systems to the expanding realm of Generative AI. Preliminary experimental results and theoretical analysis demonstrate Bullion's superior performance in handling the unique demands of machine learning workloads compared to existing columnar storage solutions. Bullion significantly reduces I/O costs for deletion compliance, achieves substantial storage savings with its optimized encoding scheme for sparse features, and drastically improves metadata parsing speed for wide-table projections. These advancements position Bullion as a critical component in the future of machine learning infrastructure, enabling organizations to efficiently manage and process the massive volumes of data required for training and inference in modern AI applications.
翻译:过去二十年见证了列式存储对数据仓库与分析领域的革命性影响,然而机器学习的迅猛发展为该领域带来了新挑战。本文提出Bullion——专为机器学习工作负载设计的列式存储系统。该方案解决了数据合规性复杂性、优化长序列稀疏特征编码、高效管理宽表投影,并在存储层引入特征量化技术。通过适配机器学习应用不断演进的需求,Bullion将列式存储扩展至广告推荐系统及生成式AI等新兴领域。初步实验与理论分析表明,相比现有列式存储方案,Bullion在处理机器学习工作负载独特需求方面表现卓越:其删除合规机制显著降低I/O开销,稀疏特征优化编码方案实现大幅存储节省,宽表投影元数据解析速度获得质的提升。这些进步使Bullion成为机器学习基础设施的关键组件,助力企业高效管理并处理现代AI应用训练与推理所需的庞大数据量。