Lakestream: A Consistent and Brokerless Data Plane for Large Foundation Model Training

Modern Large Foundation Model (LFM) training has transformed the data pipeline from a static ingestion layer into a dynamic component that must co-evolve with the training process. Existing systems are ill-equipped: colocated dataloaders offer no failure isolation, while message queue-based disaggregated dataloaders operate on a record/offset abstraction that cannot express the batch-level semantics required by distributed training. We present Lakestream, a brokerless, object-store-native training data plane with three key properties. First, it introduces the Transactional Global Batch (TGB), which builds on lakehouse-style ACID storage semantics and extends them with training-specific consistency, including atomic all-rank batch visibility, a globally ordered step sequence, checkpoint-aligned lifecycle management, and end-to-end exactly-once recovery. Second, it realizes recovery and retention directly in the storage layer, by inlining producer state in the manifest and tying reclamation to distributed checkpoint state. Third, its Decentralized Adaptive Commit (DAC) algorithm sustains stable ingestion throughput as the manifest grows, without any inter-producer communication. Evaluations on large-scale multimodal pre-training and SFT workloads using 64 GPUs show that Lakestream outperforms colocated dataloader throughput while providing full failure isolation, outperforms Apache Kafka in ingestion throughput, and achieves lower consumer read latency than Kafka.

翻译：现代大规模基础模型训练已将数据流水线从静态数据摄取层转变为必须与训练过程协同演化的动态组件。现有系统难以胜任：同地部署的数据加载器缺乏故障隔离能力，而基于消息队列的分离式数据加载器采用记录/偏移量抽象机制，无法表达分布式训练所需的批次级语义。我们提出Lakestream——一种无代理、以对象存储为核心的计算训练数据平面，具备三个关键特性。首先，它引入事务性全局批次概念，该概念基于湖仓式ACID存储语义，并扩展了训练专用的一致性保障，包括原子化的全秩批次可见性、全局有序的步骤序列、与检查点对齐的生命周期管理以及端到端的精确一次恢复。其次，通过将生产者状态内联至清单文件并将回收机制与分布式检查点状态关联，直接在存储层实现恢复与保留功能。第三，其去中心化自适应提交算法可在清单文件增长时维持稳定的数据摄取吞吐量，且无需任何生产者间通信。基于64个GPU的大规模多模态预训练与SFT工作负载评估表明，Lakestream在提供完全故障隔离的同时，其吞吐量优于同地部署的数据加载器；在数据摄取吞吐量上优于Apache Kafka，且消费者读取延迟低于Kafka。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CMU博士论文】基础模型训练中网络规模数据的负责任与高效使用

专知会员服务

14+阅读 · 2025年12月14日

从数据中心视角出发的高效大语言模型训练综述

专知会员服务

23+阅读 · 2025年10月31日

大规模语言模型在分布式基础设施上的高效训练：综述

专知会员服务

27+阅读 · 2024年7月30日

数据湖核心能力解析

专知会员服务

33+阅读 · 2024年6月12日