Mojo is an emerging programming language built on MLIR (Multi-Level Intermediate Representation) and supports JIT (Just-in-Time) compilation. It enables transparent hardware-specific optimizations (e.g., for CPUs and GPUs), while allowing users to express their logic using Python-like user-friendly syntax. Mojo has demonstrated strong performance on tensor operations; however, its capabilities for relational operations (e.g., filtering, join, and group-by aggregation) common in data science workflows, remain unexplored. To date, no dataframe implementation exists in the Mojo ecosystem. In this paper, we introduce the first Mojo-native dataframe library, called MojoFrame, that supports core relational operations and user-defined functions (UDFs). MojoFrame is built on top of Mojo's tensor to achieve fast operations on numeric columns, while utilizing a cardinality-aware approach to effectively integrate non-numeric columns for flexible data representation. To achieve high efficiency, MojoFrame takes significantly different approaches than existing libraries. We show that MojoFrame supports all operations for TPC-H queries and a selection of TPC-DS queries with promising performance, achieving up to 4.60x speedup versus existing dataframe libraries in other programming languages. Nevertheless, there remain optimization opportunities for MojoFrame (and the Mojo language), particularly in in-memory data representation and dictionary operations.
翻译:Mojo是一种新兴编程语言,构建于MLIR(多层中间表示)之上并支持JIT(即时)编译。该语言能够实现透明的硬件特定优化(例如针对CPU和GPU),同时允许用户使用类似Python的友好语法表达逻辑。尽管Mojo在张量运算中展现出强大性能,但其在数据科学工作流中常见的关联操作(如过滤、连接和分组聚合)能力仍有待探索。截至目前,Mojo生态系统中尚未存在任何数据帧实现。本文首次提出名为MojoFrame的原生Mojo数据帧库,该库支持核心关联操作与用户自定义函数。MojoFrame基于Mojo的张量机制构建,实现了数值列的高效运算,同时采用基数感知方法有效整合非数值列以实现灵活的数据表示。为达成高效率,MojoFrame采取了与现有库截然不同的技术路径。实验表明,MojoFrame可支撑TPC-H查询的全部操作及特定TPC-DS查询,性能表现优异——相较于其他编程语言的现有数据帧库,最高可实现4.60倍加速。尽管如此,MojoFrame(以及Mojo语言)在内存数据表示和字典操作方面仍存在优化空间。