Data lakes spend a significant fraction of query execution time on scanning data from remote storage. Decoding alone accounts for 46% of runtime when running TPC-H directly on Parquet files. To address this bottleneck, we propose a vision for a data processing SmartNIC for the cloud that sits on the network datapath of compute nodes to offload decoding and pushed-down operators, effectively hiding the cost of querying raw files. Our experimental estimations with DuckDB suggest that by operating directly on pre-filtered data as delivered by a SmartNIC, significantly smaller CPUs can still match query throughput of traditional setups.
翻译:数据湖在查询执行过程中花费大量时间从远程存储扫描数据。直接在Parquet文件上运行TPC-H时,仅解码操作就占运行时长的46%。为应对此瓶颈,我们提出一种面向云环境的数据处理智能网卡愿景:该网卡部署于计算节点的网络数据路径上,用以卸载解码操作及下推算子,从而有效隐藏查询原始文件的成本。基于DuckDB的实验评估表明,通过直接处理智能网卡传输的预过滤数据,显著更小的CPU仍能达到传统架构的查询吞吐量。