Modern data analytics workloads combine relational data processing with machine learning (ML). Most DBMS handle these workloads by offloading these ML operations to external specialized ML systems. While both DBMS and ML systems go to great lengths to optimize performance for their specific workloads, significant performance is lost when used in combination, due to data movement across system boundaries, conversions between incompatible internal data formats, and the lack of cross system optimizations. A key idea to remove these bottlenecks is to integrate existing data manipulation systems with ML systems by building a common intermediate layer (IR). Although this idea has been explored before (Weld, Delite), previous such attempts require significant re-engineering of prior systems and still fall short in achieving best-of-breed performance for individual tasks (e.g., SQL, Deep Learning). Specifically, they rely on re-implementing existing systems using a generic set of operators and fail to match best-of-breed individual performance due to the inability to recover high-level optimizations from this generic IR through compiler analysis. We present Flern, the first intermediate-layer integration between DB and ML systems that are best-of-breed individually, competitive with the best compiled query engines such as HyPer on comprehensive relational benchmarks (TPC-H) and competitive with TensorFlow and PyTorch in state-of-the-art ML models (e.g., DeepSpeech, SqueezeNet, Transformers) and also represents a new state-of-the-art for integration. A key realization is to architect intermediate layers based on generative programming capabilities, which preserves high-level contextual information for cross optimizations and enables the construction of a variety of complex structures and cross system optimizations with minimal effort.
翻译:现代数据分析工作负载结合了关系数据处理与机器学习(ML)。大多数数据库管理系统(DBMS)通过将ML操作卸载到外部专用ML系统来处理此类负载。尽管DBMS和ML系统都为其特定负载优化性能付出了巨大努力,但由于跨系统边界的数据移动、不兼容的内部数据格式之间的转换以及缺乏跨系统优化,两者组合使用时性能损失显著。消除这些瓶颈的一个关键思路是构建通用中间层(IR)来集成现有数据操作系统与ML系统。尽管这一思路此前已有探索(如Weld、Delite),但以往的尝试需要对现有系统进行大量重构,且仍无法在单项任务(如SQL、深度学习)上达到最优性能。具体而言,它们依赖基于通用算子集重新实现现有系统,但由于无法通过编译器分析从通用IR中恢复高层优化,因此无法匹配单项任务的最优性能。我们提出Flern,这是首个在数据库与ML系统之间实现中间层集成的方案,其单项性能均达最优水平:在综合关系基准测试(TPC-H)中与HyPer等最先进的编译查询引擎性能相当,在最新ML模型(如DeepSpeech、SqueezeNet、Transformers)中与TensorFlow和PyTorch性能相当,同时代表了集成领域的新前沿。关键认识在于,基于生成式编程能力来构建中间层,这能保留用于跨优化的高层上下文信息,从而以最小代价构建多种复杂结构并实现跨系统优化。