Configuration Validation with Large Language Models

Misconfigurations are the major causes of software failures. Existing configuration validation techniques rely on manually written rules or test cases, which are expensive to implement and maintain, and are hard to be comprehensive. Leveraging machine learning (ML) and natural language processing (NLP) for configuration validation is considered a promising direction, but has been facing challenges such as the need of not only large-scale configuration data, but also system-specific features and models which are hard to generalize. Recent advances in Large Language Models (LLMs) show the promises to address some of the long-lasting limitations of ML/NLP-based configuration validation techniques. In this paper, we present an exploratory analysis on the feasibility and effectiveness of using LLMs like GPT and Codex for configuration validation. Specifically, we take a first step to empirically evaluate LLMs as configuration validators without additional fine-tuning or code generation. We develop a generic LLM-based validation framework, named Ciri, which integrates different LLMs. Ciri devises effective prompt engineering with few-shot learning based on both valid configuration and misconfiguration data. Ciri also validates and aggregates the outputs of LLMs to generate validation results, coping with known hallucination and nondeterminism of LLMs. We evaluate the validation effectiveness of Ciri on five popular LLMs using configuration data of six mature, widely deployed open-source systems. Our analysis (1) confirms the potential of using LLMs for configuration validation, (2) understands the design space of LLMbased validators like Ciri, especially in terms of prompt engineering with few-shot learning, and (3) reveals open challenges such as ineffectiveness in detecting certain types of misconfigurations and biases to popular configuration parameters.

翻译：误配置是软件故障的主要原因。现有配置验证技术依赖人工编写的规则或测试用例，其实现和维护成本高昂，且难以实现全面覆盖。利用机器学习（ML）和自然语言处理（NLP）进行配置验证虽被视为有前景的研究方向，但面临诸多挑战：不仅需要海量配置数据，还需要难以泛化的系统专属特征与模型。大语言模型（LLMs）的最新进展为突破基于ML/NLP的配置验证技术的长期局限提供了可能。本文对使用GPT、Codex等大语言模型进行配置验证的可行性与有效性进行了探索性分析。具体而言，我们首次通过实证研究评估了无需额外微调或代码生成的LLM配置验证能力。我们开发了名为Ciri的通用LLM配置验证框架，该框架集成了不同LLM，基于有效配置与误配置数据设计了结合少样本学习的有效提示工程。Ciri还能验证并聚合LLM输出以生成验证结果，有效应对LLM已知的幻觉与非确定性特征。我们使用六个成熟且广泛部署的开源系统的配置数据，对Ciri在五种主流LLM上的验证效果进行了评估。分析结果（1）证实了将LLM用于配置验证的潜力，（2）揭示了Ciri这类LLM验证器的设计空间，特别是结合少样本学习的提示工程，（3）指出了开放挑战，包括对特定类型误配置检测效果不佳和流行配置参数的偏差问题。