Data pre-processing pipelines are the bread and butter of any successful AI project. We introduce a novel programming model for pipelines in a data lakehouse, allowing users to interact declaratively with assets in object storage. Motivated by real-world industry usage patterns, we exploit these new abstractions with a columnar and differential cache to maximize iteration speed for data scientists, who spent most of their time in pre-processing - adding or removing features, restricting or relaxing time windows, wrangling current or older datasets. We show how the new cache works transparently across programming languages, schemas and time windows, and provide preliminary evidence on its efficiency on standard data workloads.
翻译:数据预处理流水线是任何成功人工智能项目的基石。本文针对数据湖仓中的流水线提出了一种新颖的编程模型,允许用户以声明式方式与对象存储中的资产进行交互。基于实际工业应用场景的驱动,我们通过列式差分缓存机制充分利用这些新型抽象,旨在最大化数据科学家的迭代效率——他们大部分时间耗费在预处理环节,包括特征增删、时间窗口调整以及新旧数据集的整合处理。我们展示了该缓存机制如何跨编程语言、数据模式与时间窗口实现透明化运作,并通过标准数据工作负载提供了其效能初步验证。