Machine learning training pipelines consume data in batches. A single training step may require thousands of samples drawn from shards distributed across a storage cluster. Issuing thousands of individual GET requests incurs per-request overhead that often dominates data transfer time. To solve this problem, we introduce GetBatch - a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution. GetBatch achieves up to 15x throughput improvement for small objects and, in a production training workload, reduces P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x compared to individual GET requests.
翻译:机器学习训练流水线以批次形式消费数据。单个训练步骤可能需要从存储集群中多个分片中抽取数千个样本。若发起数千个独立的GET请求,其单请求开销通常会主导数据传输时间。为解决此问题,我们提出GetBatch——一种新型对象存储API,将批量检索提升为一等存储操作,用单一确定性、容错的流式执行替代独立的GET操作。相较于独立GET请求,GetBatch对小对象可实现高达15倍的吞吐量提升,在生产训练负载中将P95批次检索延迟降低2倍,P99单对象尾部延迟降低3.7倍。