Background/Context: Large Language Models (LLMs) demonstrate strong performance on low-dimensional software engineering optimization tasks ($\le$11 features) but consistently underperform on high-dimensional problems where Bayesian methods dominate. A fundamental gap exists in understanding how systematic integration of domain knowledge (whether from humans or automated reasoning) can bridge this divide. Objective/Aim: We compare human versus artificial intelligence strategies for generating domain knowledge. We systematically evaluate four distinct architectures to determine if structured knowledge integration enables LLMs to generate effective warm starts for high-dimensional optimization. Method: We evaluate four approaches on MOOT datasets stratified by dimensionality: (1) Human-in-the-Loop Domain Knowledge Prompting (H-DKP), utilizing asynchronous expert feedback loops; (2) Adaptive Multi-Stage Prompting (AMP), implementing sequential constraint identification and validation; (3) Dimension-Aware Progressive Refinement (DAPR), conducting optimization in progressively expanding feature subspaces; and (4) Hybrid Knowledge-Model Approach (HKMA), synthesizing statistical scouting (TPE) with RAG-enhanced prompting. Performance is quantified via Chebyshev distance to optimal solutions and ranked using Scott-Knott clustering against an established baseline for LLM generated warm starts. Note that all human studies conducted as part of this study will comply with the policies of our local Institutional Review Board.
翻译:背景/情境:大型语言模型(LLM)在低维软件工程优化任务(特征数≤11)上表现出色,但在高维问题上始终表现不佳,而贝叶斯方法在该领域占据主导地位。关于如何系统性地整合领域知识(无论是来自人类还是自动推理)以弥合这一差距,目前存在根本性的理解空白。目标/目的:我们比较了人类与人工智能生成领域知识的策略。我们系统评估了四种不同的架构,以确定结构化的知识整合是否能使LLM为高维优化生成有效的热启动。方法:我们在按维度分层的MOOT数据集上评估了四种方法:(1)人机交互领域知识提示(H-DKP),利用异步专家反馈循环;(2)自适应多阶段提示(AMP),实施顺序约束识别与验证;(3)维度感知渐进细化(DAPR),在逐步扩展的特征子空间中进行优化;(4)混合知识模型方法(HKMA),将统计探测(TPE)与RAG增强提示相结合。性能通过切比雪夫距离至最优解进行量化,并使用Scott-Knott聚类与既有的LLM生成热启动基线进行排序比较。需注意,本研究涉及的所有人类实验均将遵守本地机构审查委员会的政策。