Blockchain systems suffer from high storage costs as every node needs to store and maintain the entire blockchain data. After investigating Ethereum's storage, we find that the storage cost mostly comes from the index, i.e., Merkle Patricia Trie (MPT). To support provenance queries, MPT persists the index nodes during the data update, which adds too much storage overhead. To reduce the storage size, an initial idea is to leverage the emerging learned index technique, which has been shown to have a smaller index size and more efficient query performance. However, directly applying it to the blockchain storage results in even higher overhead owing to the requirement of persisting index nodes and the learned index's large node size. To tackle this, we propose COLE, a novel column-based learned storage for blockchain systems. We follow the column-based database design to contiguously store each state's historical values, which are indexed by learned models to facilitate efficient data retrieval and provenance queries. We develop a series of write-optimized strategies to realize COLE in disk environments. Extensive experiments are conducted to validate the performance of the proposed COLE system. Compared with MPT, COLE reduces the storage size by up to 94% while improving the system throughput by $1.4\times$-$5.4\times$.
翻译:摘要:区块链系统面临高昂的存储成本,因为每个节点都需要存储和维护完整的区块链数据。在调查以太坊的存储机制后,我们发现存储成本主要源于索引,即默克尔帕特里夏树(MPT)。为支持溯源查询,MPT在数据更新时持久化索引节点,这导致了过多的存储开销。为降低存储规模,一个初步思路是利用新兴的学习型索引技术,该技术已被证明具有更小的索引规模和更高效的查询性能。然而,由于需要持久化索引节点以及学习型索引较大的节点尺寸,直接将其应用于区块链存储反而会导致更高的开销。为解决这一问题,我们提出COLE——一种面向区块链系统的新型基于列的学习型存储方案。我们遵循基于列的数据库设计理念,连续存储每个状态的历史值,并通过学习模型建立索引以支持高效数据检索和溯源查询。我们开发了一系列写优化策略,使COLE能够在磁盘环境中实现。大量实验验证了所提出COLE系统的性能。与MPT相比,COLE将存储规模降低高达94%,同时将系统吞吐量提升1.4倍至5.4倍。