Current graph systems can easily process billions of data, however when increased to exceed hundred billions, the performance decreases dramatically, time series data always be very huge, consequently computation on time series graphs still remains challenging nowadays. In current piece of work, we introduces SharkGraph, a (distributed file system) DFS-based time series graph system, used a novel storage structure (Time Series Graph Data File) TGF, By reading file stream to iterate graph computation, SharkGraph is able to execute batch graph query, simulation, data mining, or clustering algorithm on exceed hundred billions edge size industry graph. Through well defined experiments that shows SharkGraph performs well on large-scale graph processing, also can support time traversal for graphs, and recover state at any position in the timeline. By repeating experiments reported for existing distributed systems like GraphX, we demonstrate that SharkGraph can easily handle hundreds billions of data, rather than GraphX which met many problems such as memory issues and skewed distribution on graph traversal. Compared with other graph systems SharkGraph uses less memory and more efficiently to process the same graph.
翻译:当前图系统能够轻松处理数十亿规模的数据,但当数据量突破千亿级时,性能会急剧下降。时序数据通常体量庞大,因此针对时序图的计算至今仍面临挑战。本文提出SharkGraph——一种基于分布式文件系统(DFS)的时序图系统,采用新型存储结构(时序图数据文件)TGF。通过读取文件流迭代图计算,SharkGraph能够对边规模超过千亿的工业级图执行批量图查询、仿真、数据挖掘或聚类算法。经过精心设计的实验表明,SharkGraph在大规模图处理方面性能优异,同时支持图的时间遍历功能,并能恢复时间线上任意位置的状态。通过复现已有的分布式系统(如GraphX)所报告的实验,我们证明:SharkGraph可轻松处理千亿级数据,而GraphX在图遍历过程中则面临内存问题、分布偏斜等诸多挑战。与其他图系统相比,SharkGraph在处理相同图时占用更少内存且效率更高。