OpenMLDB: A Real-Time Relational Data Feature Computation System for Online ML

Efficient and consistent feature computation is crucial for a wide range of online ML applications. Typically, feature computation is divided into two distinct phases, i.e., offline stage for model training and online stage for model serving. These phases often rely on execution engines with different interface languages and function implementations, causing significant inconsistencies. Moreover, many online ML features involve complex time-series computations (e.g., functions over varied-length table windows) that differ from standard streaming and analytical queries. Existing data processing systems (e.g., Spark, Flink, DuckDB) often incur multi-second latencies for these computations, making them unsuitable for real-time online ML applications that demand timely feature updates. This paper presents OpenMLDB, a feature computation system deployed in 4Paradigm's SageOne platform and over 100 real scenarios. Technically, OpenMLDB first employs a unified query plan generator for consistent computation results across the offline and online stages, significantly reducing feature deployment overhead. Second, OpenMLDB provides an online execution engine that resolves performance bottlenecks caused by long window computations (via pre-aggregation) and multi-table window unions (via data self-adjusting). It also provides a high-performance offline execution engine with window parallel optimization and time-aware data skew resolving. Third, OpenMLDB features a compact data format and stream-focused indexing to maximize memory usage and accelerate data access. Evaluations in testing and real workloads reveal significant performance improvements and resource savings compared to the baseline systems. The open community of OpenMLDB now has over 150 contributors and gained 1.6k stars on GitHub.

翻译：高效且一致的特征计算对广泛的在线机器学习应用至关重要。通常，特征计算分为两个不同的阶段，即用于模型训练的离线阶段和用于模型服务的在线阶段。这些阶段通常依赖于具有不同接口语言和函数实现的执行引擎，从而导致显著的不一致性。此外，许多在线机器学习特征涉及复杂的时间序列计算（例如，基于可变长度表窗口的函数），这与标准的流式处理和分析查询不同。现有的数据处理系统（如Spark、Flink、DuckDB）在进行此类计算时通常会产生数秒的延迟，这使得它们无法满足需要及时特征更新的实时在线机器学习应用。本文介绍了OpenMLDB，这是一个部署于第四范式SageOne平台并应用于超过100个真实场景的特征计算系统。在技术上，OpenMLDB首先采用统一的查询计划生成器，以确保离线与在线阶段计算结果的一致性，从而显著降低特征部署开销。其次，OpenMLDB提供了一个在线执行引擎，通过预聚合解决长窗口计算导致的性能瓶颈，并通过数据自调整解决多表窗口联合操作带来的问题。同时，它还提供了一个高性能的离线执行引擎，具备窗口并行优化和时间感知的数据倾斜解决能力。第三，OpenMLDB采用紧凑的数据格式和面向流的索引机制，以最大化内存使用效率并加速数据访问。在测试和真实工作负载中的评估表明，与基线系统相比，OpenMLDB在性能和资源节省方面均有显著提升。OpenMLDB的开源社区目前拥有超过150名贡献者，并在GitHub上获得了1.6k星标。