Blockchain systems suffer from high storage costs as every node needs to store and maintain the entire blockchain data. After investigating Ethereum's storage, we find that the storage cost mostly comes from the index, i.e., Merkle Patricia Trie (MPT), that is used to guarantee data integrity and support provenance queries. To reduce the index storage overhead, an initial idea is to leverage the emerging learned index technique, which has been shown to have a smaller index size and more efficient query performance. However, directly applying it to the blockchain storage results in even higher overhead owing to the blockchain's persistence requirement and the learned index's large node size. Meanwhile, existing learned indexes are designed for in-memory databases, whereas blockchain systems require disk-based storage and feature frequent data updates. To address these challenges, we propose COLE, a novel column-based learned storage for blockchain systems. We follow the column-based database design to contiguously store each state's historical values, which are indexed by learned models to facilitate efficient data retrieval and provenance queries. We develop a series of write-optimized strategies to realize COLE in disk environments. Extensive experiments are conducted to validate the performance of the proposed COLE system. Compared with MPT, COLE reduces the storage size by up to 94% while improving the system throughput by 1.4X-5.4X.
翻译:区块链系统因每个节点需存储并维护完整链上数据而面临高昂的存储成本。通过调研以太坊存储机制,我们发现存储成本主要源于用于保证数据完整性与支持溯源查询的索引结构——默克尔帕特里夏树(MPT)。为降低索引存储开销,初步思路是借鉴新兴的学习型索引技术——该技术已被证实具备更小的索引体积与更高效的查询性能。然而,受制于区块链的持久化需求以及学习型索引的大节点体积,直接将其应用于区块链存储反而会带来更高开销。此外,现有学习型索引面向内存数据库设计,而区块链系统要求基于磁盘的存储架构并需频繁更新数据。为应对上述挑战,我们提出COLE——一种面向区块链系统的新型列式学习型存储方案。遵循列式数据库设计理念,我们将各状态的每个历史值连续存储,并通过学习型模型对其建立索引,以实现高效的数据检索与溯源查询。针对磁盘环境,我们开发了一系列写优化策略来支撑COLE的实践部署。通过大量实验验证了所提COLE系统的性能。与MPT相比,COLE在缩减94%存储空间的同时,将系统吞吐量提升至1.4-5.4倍。