Parser-based log compression, which separates static tem- plates from dynamic variables, is a promising approach to exploit the unique structure of log data. However, its perfor- mance on complex production logs is often unsatisfactory. This performance gap coincides with a known degradation in the accuracy of its core log parsing component on such data, motivating our investigation into a foundational yet unverified question: does higher parsing accuracy necessarily lead to better compression ratio? To answer this, we conduct the first empirical study quanti- fying this relationship and find that a higher parsing accuracy does not guarantee a better compression ratio. Instead, our findings reveal that compression ratio is dictated by achiev- ing effective pattern-based grouping and encoding, i.e., the partitioning of tokens into low entropy, highly compressible groups. Guided by this insight, we design DeLog, a novel log com- pressor that implements a Pattern Signature Synthesis mecha- nism to achieve efficient pattern-based grouping. On 16 public and 10 production datasets, DeLog achieves state-of-the-art compression ratio and speed.
翻译:基于解析器的日志压缩通过分离静态模板与动态变量,是一种利用日志数据独特结构的前沿方法。然而,其在复杂生产日志上的性能往往不尽如人意。这种性能差距与其核心日志解析组件在此类数据上准确率下降的现象相吻合,从而促使我们探究一个基础但尚未验证的问题:更高的解析准确率是否必然带来更好的压缩比?为回答此问题,我们首次开展了量化该关系的实证研究,发现更高的解析准确率并不能保证更好的压缩比。相反,我们的研究结果表明,压缩比取决于能否实现有效的基于模式的分组与编码,即将词元划分为低熵、高可压缩性的组别。基于这一洞见,我们设计了DeLog——一种新型日志压缩器,它通过实现模式签名合成机制来达成高效的基于模式的分组。在16个公共数据集和10个生产数据集上,DeLog实现了业界领先的压缩比与处理速度。