ECLIPSE: Semantic Entropy-LCS for Cross-Lingual Industrial Log Parsing

Log parsing, a vital task for interpreting the vast and complex data produced within software architectures faces significant challenges in the transition from academic benchmarks to the industrial domain. Existing log parsers, while highly effective on standardized public datasets, struggle to maintain performance and efficiency when confronted with the sheer scale and diversity of real-world industrial logs. These challenges are two-fold: 1) massive log templates: The performance and efficiency of most existing parsers will be significantly reduced when logs of growing quantities and different lengths; 2) Complex and changeable semantics: Traditional template-matching algorithms cannot accurately match the log templates of complicated industrial logs because they cannot utilize cross-language logs with similar semantics. To address these issues, we propose ECLIPSE, Enhanced Cross-Lingual Industrial log Parsing with Semantic Entropy-LCS, since cross-language logs can robustly parse industrial logs. On the one hand, it integrates two efficient data-driven template-matching algorithms and Faiss indexing. On the other hand, driven by the powerful semantic understanding ability of the Large Language Model (LLM), the semantics of log keywords were accurately extracted, and the retrieval space was effectively reduced. It is worth noting that we launched a Chinese and English cross-platform industrial log parsing benchmark ECLIPSE-Bench to evaluate the performance of mainstream parsers in industrial scenarios. Our experimental results, conducted across public benchmarks and the proprietary ECLIPSE-Bench dataset, underscore the superior performance and robustness of our proposed ECLIPSE. Notably, ECLIPSE delivers state-of-the-art performance when compared to strong baselines on diverse datasets and preserves a significant edge in processing efficiency.

翻译：日志解析是解读软件架构所产生的海量复杂数据的关键任务，其从学术基准向工业领域的过渡面临着重大挑战。现有的日志解析器虽然在标准化的公共数据集上表现出色，但在面对现实工业日志的巨大规模和多样性时，往往难以维持其性能与效率。这些挑战主要体现在两个方面：1) 海量日志模板：当日志数量不断增长且长度各异时，大多数现有解析器的性能和效率会显著下降；2) 复杂多变的语义：传统的模板匹配算法由于无法有效利用具有相似语义的跨语言日志，难以准确匹配复杂工业日志的模板。为解决这些问题，我们提出了ECLIPSE（基于语义熵-LCS的增强型跨语言工业日志解析），它利用跨语言日志能够鲁棒地解析工业日志。一方面，它集成了两种高效的数据驱动模板匹配算法与Faiss索引；另一方面，借助大语言模型强大的语义理解能力，准确提取了日志关键词的语义，并有效缩小了检索空间。值得注意的是，我们发布了一个中英文跨平台工业日志解析基准ECLIPSE-Bench，用以评估主流解析器在工业场景下的性能。我们在公开基准和自有的ECLIPSE-Bench数据集上进行的实验结果，充分证明了我们所提出的ECLIPSE具有卓越的性能和鲁棒性。特别地，与多个强基线方法在不同数据集上相比，ECLIPSE实现了最先进的性能，并在处理效率上保持了显著优势。