LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

CI failure logs are large (median 5k lines, max 200k in this corpus) and noisy. Coding agents that try to debug them depend on an upstream tool to reduce the log to a manageable context, but the field has had no public empirical comparison of which reductions preserve enough evidence for downstream LLM diagnosis. We introduce LogDx-CI, a benchmark that compares 11 context-reduction tools (raw, tail, grep, three RTK modes, two real LLM map-reduce summarizers, three hybrid routers) on 35 real GitHub Actions failure cases, scored by 3 LLM debugger families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini) plus a Sonnet 4.6 tool-using agent. We report three load-bearing findings. (1)~Hybrid grep+tail routers dominate the cost-quality Pareto frontier; the top two methods score 0.670 / 0.666 at $\sim$ \$0.03 per case, same-ballpark quality as standalone grep at $4.5\times$ fewer tokens. (2)~In the agent-loop regime, the quality range across reduction tools collapses $7\times$ (single-shot spread 0.42 $\to$ agent-loop spread 0.059); the agent rescues weak contexts via follow-up tool calls. However, cost differences persist: weak contexts force the agent to issue 2--4$\times$ more tool calls to recover. (3)~A cross-family LLM-summary pair (gpt-5-mini summarizer feeding a Claude Haiku debugger) beats the same-family pair by $+0.071$ averaged across four diagnoser variants, falsifying the self-call-bias hypothesis on this task. The gpt-5-mini summarizer is also the agent-loop \#1 method (score 0.749) at $0.37$ tool-calls per case and $10\times$ lower reducer cost than the Haiku summarizer (\$0.18 vs \$1.75 per case). All data, code, per-case bundles, and reproducibility infrastructure are public.

翻译：持续集成（CI）失败日志规模庞大（本语料库中位数5000行，最大20万行）且噪声密集。尝试调试此类日志的编码智能体依赖上游工具将日志缩减至可管理的上下文长度，然而该领域至今缺乏公开的经验性比较，以验证何种缩减策略能为下游大语言模型（LLM）诊断保留充分的证据。我们提出LogDx-CI基准测试，该基准在35个真实GitHub Actions失败案例上比较了11种上下文缩减工具（原始日志、尾部截取、grep、三种RTK模式、两种真实LLM映射归约摘要器、三种混合路由选择器），并由3类LLM调试器家族（Claude Haiku 4.5、Claude Sonnet 4.6、OpenAI gpt-5-mini）以及一个Sonnet 4.6工具调用智能体进行评分。我们报告三项关键发现：（1）混合grep+尾部路由选择器主导成本-质量帕累托前沿；表现最好的两种方法在单案例约0.03美元成本下取得0.670/0.666分，质量与独立grep相当，但token消耗降低4.5倍。（2）在智能体循环模式下，不同缩减工具间的质量差异缩小7倍（单次得分差距从0.42降至智能体循环下的0.059）；智能体通过后续工具调用弥补弱上下文。然而成本差异持续存在：弱上下文迫使智能体发出2-4倍的工具调用以恢复信息。（3）跨家族LLM-摘要配对（gpt-5-mini摘要器为Claude Haiku调试器提供输入）在四种诊断器变体上平均比同家族配对高出0.071分，反驳了该任务中的自调用偏差假设。gpt-5-mini摘要器同时是智能体循环中排名第一的方法（得分0.749），单案例工具调用仅0.37次，且缩减器成本比Haiku摘要器低10倍（0.18美元 vs 1.75美元）。所有数据、代码、各案例数据包及可复现性基础设施均已公开。