The exponential growth of artificial intelligence (AI) and machine learning (ML) applications has necessitated the development of efficient storage solutions for vector and tensor data. This paper presents a novel approach for tensor storage in a Lakehouse architecture using Delta Lake. By adopting the multidimensional array storage strategy from array databases and sparse encoding methods to Delta Lake tables, experiments show that this approach has demonstrated notable improvements in both space and time efficiencies when compared to traditional serialization of tensors. These results provide valuable insights for the development and implementation of optimized vector and tensor storage solutions in data-intensive applications, contributing to the evolution of efficient data management practices in AI and ML domains in cloud-native environments
翻译:随着人工智能(AI)和机器学习(ML)应用的指数级增长,对向量和张量数据的高效存储解决方案的需求变得尤为迫切。本文提出了一种在Lakehouse架构中利用Delta Lake进行张量存储的新方法。通过将数组数据库中的多维数组存储策略及稀疏编码方法引入Delta Lake表,实验表明,与传统的张量序列化方法相比,该方法在空间和时间效率上均取得了显著提升。这些结果为数据密集型应用中优化向量与张量存储解决方案的开发与实施提供了宝贵见解,并推动了云原生环境下AI和ML领域高效数据管理实践的发展。