Parser-based log compression, which separates static templates from dynamic variables, is a promising approach to exploit the unique structure of log data. However, its performance on complex production logs is often unsatisfactory. This performance gap coincides with a known degradation in the accuracy of its core log parsing component on such data, motivating our investigation into a foundational yet unverified question: does higher parsing accuracy necessarily lead to better compression ratio? To answer this, we conduct the first empirical study quantifying this relationship and find that a higher parsing accuracy does not guarantee a better compression ratio. Instead, our findings reveal that compression ratio is dictated by achieving effective pattern-based grouping and encoding, i.e., the partitioning of tokens into low entropy, highly compressible groups. Guided by this insight, we design DeLog, a novel log compressor that implements a Pattern Signature Synthesis mechanism to achieve efficient pattern-based grouping. On 16 public and 10 production datasets, DeLog achieves state-of-the-art compression ratio and speed.
翻译:基于解析器的日志压缩方法通过分离静态模板与动态变量,为利用日志数据的独特结构提供了一种有前景的途径。然而,该方法在处理复杂生产日志时的性能往往不尽如人意。这种性能差距与其核心日志解析组件在此类数据上已知的精度下降现象相吻合,从而促使我们探究一个基础但尚未验证的问题:更高的解析精度是否必然带来更好的压缩比?为回答此问题,我们首次开展了量化该关系的实证研究,发现更高的解析精度并不能保证更好的压缩比。相反,我们的研究结果表明,压缩比取决于能否实现有效的基于模式的分组与编码,即将令牌划分为低熵、高可压缩性的组。基于这一洞见,我们设计了DeLog,一种新型日志压缩器,它通过实现模式签名合成机制来实现高效的基于模式的分组。在16个公共数据集和10个生产数据集上的实验表明,DeLog实现了最先进的压缩比与压缩速度。