LogSieve: Task-Aware CI Log Reduction for Sustainable LLM-Based Analysis

from arxiv, Preprint. Accepted for presentation at Mining Software Repositories (MSR'26), co-located ICSE 2026. The final version will appear in the ACM Digital Library as part of the MSR'26 conference proceedings

Logs are essential for understanding Continuous Integration (CI) behavior, particularly for diagnosing build failures and performance regressions. Yet their growing volume and verbosity make both manual inspection and automated analysis increasingly costly, time-consuming, and environmentally costly. While prior work has explored log compression, anomaly detection, and LLM-based log analysis, most efforts target structured system logs rather than the unstructured, noisy, and verbose logs typical of CI workflows. We present LogSieve, a lightweight, RCA-aware and semantics-preserving log reduction technique that filters low-information lines while retaining content relevant to downstream reasoning. Evaluated on CI logs from 20 open-source Android projects using GitHub Actions, LogSieve achieves an average 42% reduction in lines and 40% reduction in tokens with minimal semantic loss. This pre-inference reduction lowers computational cost and can proportionally reduce energy use (and associated emissions) by decreasing the volume of data processed during LLM inference. Compared with structure-first baselines (LogZip and random-line removal), LogSieve preserves much higher semantic and categorical fidelity (Cosine = 0.93, GPTScore = 0.93, 80% exact-match accuracy). Embedding-based classifiers automate relevance detection with near-human accuracy (97%), enabling scalable and sustainable integration of semantics-aware filtering into CI workflows. LogSieve thus bridges log management and LLM reasoning, offering a practical path toward greener and more interpretable CI automation.

翻译：日志对于理解持续集成（CI）行为至关重要，特别是在诊断构建失败和性能回归方面。然而，其日益增长的体量与冗长特性使得人工检查与自动化分析的成本、时间消耗及环境代价不断攀升。尽管已有研究探索了日志压缩、异常检测及基于LLM的日志分析，但多数工作针对结构化系统日志，而非CI工作流中典型的非结构化、含噪声且冗长的日志。本文提出LogSieve——一种轻量级、根因分析感知且保持语义的日志精简技术，能够过滤低信息量行同时保留与下游推理相关的内容。基于GitHub Actions中20个开源Android项目的CI日志评估显示，LogSieve在最小化语义损失的前提下，平均减少42%的行数与40%的标记数。这种推理前精简通过减少LLM推理期间处理的数据量，降低了计算成本，并可相应减少能源消耗（及相关排放）。与结构优先基线方法（LogZip和随机行删除）相比，LogSieve保持了更高的语义与类别保真度（余弦相似度=0.93，GPTScore=0.93，精确匹配准确率80%）。基于嵌入的分类器以接近人类的准确率（97%）实现相关性自动检测，使得语义感知过滤能够可扩展且可持续地集成至CI工作流。因此，LogSieve在日志管理与LLM推理之间架起桥梁，为构建更绿色、更可解释的CI自动化提供了可行路径。