Storing and streaming high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language. In this paper we introduce TensorBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries. We use Hierarchical Statistical Indices (HSI) for query acceleration. Our architecture allows to directly address tensors on block level using HTTP range reads. Once in GPU memory, data can be transformed using PyTorch transforms. We provide a generic PyTorch dataset type with a corresponding dataset factory translating relational queries and requested transformations as an instance. By making use of the HSI, irrelevant blocks can be skipped without reading them as those indices contain statistics on their content at different hierarchical resolution levels. This is an opinionated architecture powered by open standards and making heavy use of open-source technology. Although, hardened for production use using geospatial-temporal data, this architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more.
翻译:随着超越自然语言领域的基础模型的兴起,存储和流式传输高维数据以用于基础模型训练成为关键需求。本文提出TensorBank——一种PB级张量湖仓架构,能够基于复杂关系查询,以线速将张量从云对象存储(COS)流式传输至GPU内存。我们采用分层统计索引(HSI)加速查询,该架构支持通过HTTP范围读取直接在块级别寻址张量。传输至GPU内存后,数据可通过PyTorch变换进行预处理。我们提供通用PyTorch数据集类型及对应数据集工厂,将关系查询和所需变换实例化为实例化对象。借助分层统计索引,可跳过不相关数据块(无需读取),因其包含不同层级分辨率下的内容统计信息。该架构基于开放标准构建并深度整合开源技术,虽经地理时空数据生产环境验证,但可泛化至计算机视觉、计算神经科学、生物序列分析等多种应用场景。