Storing and streaming high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language. In this paper we introduce TensorBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries. We use Hierarchical Statistical Indices (HSI) for query acceleration. Our architecture allows to directly address tensors on block level using HTTP range reads. Once in GPU memory, data can be transformed using PyTorch transforms. We provide a generic PyTorch dataset type with a corresponding dataset factory translating relational queries and requested transformations as an instance. By making use of the HSI, irrelevant blocks can be skipped without reading them as those indices contain statistics on their content at different hierarchical resolution levels. This is an opinionated architecture powered by open standards and making heavy use of open-source technology. Although, hardened for production use using geospatial-temporal data, this architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more.
翻译:随着基础模型超越自然语言领域的兴起,存储和流式传输高维数据以训练基础模型成为关键需求。本文提出TensorBank——一种PB级张量数据湖,能够基于复杂关系查询,以线速将张量从云对象存储流式传输至GPU内存。我们采用层次化统计索引加速查询,该架构支持通过HTTP范围读取直接寻址块级张量。数据进入GPU内存后,可利用PyTorch变换进行转换。我们提供了通用PyTorch数据集类型及对应的数据集工厂,将关系查询与所需转换封装为实例。通过利用层次化统计索引,无需读取无关块即可跳过,因为这些索引包含不同层级分辨率下内容的统计信息。这是一种基于开放标准并由开源技术驱动的定制化架构,虽经地理空间-时序数据生产环境验证,但可泛化至计算机视觉、计算神经科学、生物序列分析等其他应用场景。