LogRouter: Adaptive Two-Level LLM Routing for Log Question Answering in Big Data Systems

Production log analytics in self-hosted, resource-constrained environments requires natural-language access to massive log streams without the cost of routing every query through a large language model. We present LogRouter, an end-to-end log question-answering system deployed on TUBITAK BILGEM's national big data platform that combines a PySpark-based Drain3 ingestion pipeline, GPU-accelerated embeddings, and dual-index storage in Apache Druid and PostgreSQL with pgvector. A two-level cost-aware router dispatches each query along one of four execution paths: direct response, Druid keyword search, template lookup with SQL generation, and pgvector semantic retrieval, while a Level-2 router selects either a 14B-class or 32B-class generator for the semantic path. A dedicated coder LLM handles text-to-SQL generation. We evaluate the system on four LogHub datasets (Linux, Apache, Windows, and Mac; 70 questions in total) under both an online full-pipeline configuration and an offline configuration that isolates the generator. The router reaches 88.4% mean accuracy across datasets and 94.7% on Linux, while the full pipeline attains a mean ROUGE-1 of 0.373, BERTScore of 0.879, RAGAS Faithfulness of 0.779, and an end-to-end latency of 18.6 s. In an apples-to-apples offline comparison, the routed system reduces mean latency by 55% versus a Fixed-32B baseline (46.3 s vs. 102.1 s) while preserving Answer Correctness within 5.8 points and exceeding a Fixed-14B baseline on RAGAS Faithfulness across every dataset. Cost-aware dispatching is therefore a practical mechanism for production log QA: routing recovers most of the quality of an always-32B configuration at less than half the latency, and the L1 keyword vocabulary makes that routing decision with high precision without a learned classifier.

翻译：在自托管、资源受限环境中，生产级日志分析需要以自然语言访问海量日志流，同时避免将所有查询通过大语言模型处理的成本。我们提出LogRouter——一个部署于TUBITAK BILGEM国家大数据平台上的端到端日志问答系统，该系统融合了基于PySpark的Drain3数据摄取管道、GPU加速嵌入技术，以及基于Apache Druid和PostgreSQL（含pgvector扩展）的双索引存储架构。系统采用双层成本感知路由器，将每个查询分配至四条执行路径之一：直接响应、Druid关键词搜索、基于模板的SQL生成查询、以及pgvector语义检索；其中第二层路由器为语义路径选择14B级或32B级生成模型。专用编码大语言模型负责文本到SQL的转换。我们在四个LogHub数据集（Linux、Apache、Windows及Mac系统，共70个问题）上，分别在全管道在线配置和隔离生成器的离线配置下评估系统性能。路由器的跨数据集平均准确率达88.4%，在Linux数据集上达94.7%；全文管道取得平均ROUGE-1分数0.373、BERTScore 0.879、RAGAS忠实度0.779，端到端延迟为18.6秒。在同条件下离线对比中，路由系统相比固定32B基线将平均延迟降低55%（46.3秒对102.1秒），同时将答案正确性保持在5.8个百分点的范围内，并在每个数据集上超越固定14B基线的RAGAS忠实度指标。因此，成本感知路由是生产级日志QA的实用机制：路由机制以不到一半的延迟恢复常驻32B配置的大部分质量，且第一层关键词词汇表无需学习分类器即可高精度做出路由决策。