Data silos create barriers in accessing and utilizing data dispersed over networks. Directly sharing data easily suffers from the long downloading time, the single point failure and the untraceable data usage. In this paper, we present Minerva, a peer-to-peer cross-cluster data query system based on InterPlanetary File System (IPFS). Minerva makes use of the distributed Hash table (DHT) lookup to pinpoint the locations that store content chunks. We theoretically model the DHT query delay and introduce the fat Merkle tree structure as well as the DHT caching to reduce it. We design the query plan for read and write operations on top of Apache Drill that enables the collaborative query with decentralized workers. We conduct comprehensive experiments on Minerva, and the results show that Minerva achieves up to $2.08 \times$ query performance acceleration compared to the original IPFS data query, and could complete data analysis queries on the Internet-like environments within an average latency of $0.615$ second. With collaborative query, Minerva could perform up to $1.39 \times$ performance acceleration than centralized query with raw data shipment.
翻译:数据孤岛给网络中分散数据的访问与利用造成了障碍。直接共享数据容易面临下载时间长、单点故障以及数据使用不可追溯等问题。本文提出一种名为米涅瓦的基于星际文件系统(IPFS)的对等跨集群数据查询系统。该系统利用分布式哈希表(DHT)定位来精确查找存储内容分片的位置。我们从理论上建立了DHT查询延迟模型,并引入胖Merkle树结构及DHT缓存以降低延迟。基于Apache Drill设计了支持读写操作的查询计划,实现了去中心化工作节点的协作查询。针对米涅瓦开展了全面实验,结果表明:相较原始IPFS数据查询,米涅瓦可实现高达$2.08 \times$的查询性能加速,在类互联网环境中完成数据分析查询的平均延迟为$0.615$秒。通过协作查询,相比采用原始数据传输的集中式查询,米涅瓦最高可实现$1.39 \times$的性能提升。