The exponential growth of artificial intelligence (AI) and machine learning (ML) applications has necessitated the development of efficient storage solutions for vector and tensor data. This paper presents a novel approach for tensor storage in a Lakehouse architecture using Delta Lake. By adopting the multidimensional array storage strategy from array databases and sparse encoding methods to Delta Lake tables, experiments show that this approach has demonstrated notable improvements in both space and time efficiencies when compared to traditional serialization of tensors. These results provide valuable insights for the development and implementation of optimized vector and tensor storage solutions in data-intensive applications, contributing to the evolution of efficient data management practices in AI and ML domains in cloud-native environments
翻译:人工智能与机器学习应用的指数级增长催生了对向量及张量数据高效存储解决方案的需求。本文提出一种在基于Delta Lake的湖仓一体化架构中实现张量存储的新方法。通过将数组数据库中的多维数组存储策略与稀疏编码方法应用于Delta Lake表,实验表明,与传统张量序列化方式相比,该方法在空间效率和时间效率上均实现了显著提升。这些成果为数据密集型应用中优化向量与张量存储方案的开发与实施提供了重要见解,并推动了云原生环境下人工智能与机器学习领域高效数据管理实践的发展。