The prevailing "parse-then-compress" paradigm in log compression fundamentally limits effectiveness by treating log parsing and compression as isolated objectives. While parsers prioritize semantic accuracy (i.e., event identification), they often obscure deep correlations between static templates and dynamic variables that are critical for storage efficiency. In this paper, we investigate this misalignment through a comprehensive empirical study and propose LogPrism, a framework that bridges the gap via unified redundancy encoding. Rather than relying on a rigid pre-parsing step, LogPrism dynamically integrates structural extraction with variable encoding by constructing a Unified Redundancy Tree (URT). This hierarchical approach effectively mines "structure+variable" co-occurrence patterns, capturing deep contextual redundancies while accelerating processing through pre-emptive pattern encoding. Extensive experiments on 16 benchmark datasets confirm that LogPrism establishes a new state-of-the-art. It achieves the highest compression ratio on 13 datasets, surpassing leading baselines by margins of 4.7% to 80.9%, while delivering superior throughput at 29.87 MB/s (1.68$\times$~43.04$\times$ faster than competitors). Moreover, when configured in single-archive mode to maximize global pattern discovery, LogPrism outperforms the best baseline by 19.39% in compression ratio while maintaining a 2.62$\times$ speed advantage.
翻译:当前主流的“先解析后压缩”日志压缩范式将日志解析与压缩视为孤立目标,从根本上限制了压缩效率。解析器虽优先保证语义准确性(即事件识别),却常常掩盖静态模板与动态变量之间的深层关联,而这种关联对存储效率至关重要。本文通过全面的实证研究揭示了这种错配问题,并提出LogPrism框架——通过统一冗余编码弥合此鸿沟。LogPrism摒弃僵化的预解析步骤,通过构建统一冗余树(URT)动态整合结构提取与变量编码。这种分层方法能有效挖掘“结构+变量”共现模式,在捕获深层上下文冗余的同时,通过预置模式编码加速处理流程。在16个基准数据集上的大量实验证实,LogPrism确立了新的性能标杆:在13个数据集上取得最高压缩率,较领先基线提升4.7%至80.9%;同时以29.87 MB/s的吞吐量实现卓越性能(达到竞争方法的1.68$\times$~43.04$\times$)。此外,当采用单归档模式以最大化全局模式发现时,LogPrism在保持2.62$\times$速度优势的同时,压缩率较最佳基线提升19.39%。