On Sequence-to-Sequence Models for Automated Log Parsing

Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.

翻译：日志解析是软件系统中一项关键的标准操作流程，能够实现系统监控、异常检测与故障诊断。然而，由于异构的日志格式、训练数据与部署数据间的分布偏移，以及基于规则方法的脆弱性，自动化日志解析仍面临诸多挑战。本研究旨在系统评估序列建模架构、表征选择、序列长度和训练数据可用性如何影响自动化日志解析的性能与计算成本。我们开展了一项受控实证研究，比较了四种序列建模架构：Transformer、Mamba状态空间模型、单向LSTM和双向LSTM模型。总计在多种数据集配置下训练了396个模型，并采用统计显著性检验的相对莱文斯坦编辑距离进行评估。Transformer取得了最低的平均相对编辑距离（0.111），其次为Mamba（0.145）、单向LSTM（0.186）和双向LSTM（0.265），数值越低表示性能越优。Mamba在保持竞争力的准确率的同时显著降低了计算成本。字符级分词通常能提升性能，序列长度对Transformer准确率的影响可忽略不计，且Mamba与Transformer均表现出比循环模型更强的样本效率。总体而言，Transformer将解析误差降低了23.4%，而Mamba在数据或计算资源受限的情况下是强有力的替代方案。这些结果同时阐明了表征选择、序列长度和样本效率的作用机制，为研究者和实践者提供了实用指导。