Automated requirements assessment traditionally relies on universal patterns as proxies for defectiveness, implemented through rule-based heuristics or machine learning classifiers trained on large annotated datasets. However, what constitutes a "defect" is inherently context-dependent and varies across projects, domains, and stakeholder interpretations. In this paper, we propose a Human-LLM Collaboration (HLC) approach that treats defect prediction as an adaptive process rather than a static classification task. HLC leverages LLM Chain-of-Thought reasoning in a feedback loop: users validate predictions alongside their explanations, and these validated examples adaptively guide future predictions through few-shot learning. We evaluate this approach using the weak word smell on the QuRE benchmark of 1,266 annotated Mercedes-Benz requirements. Our results show that HLC effectively adapts to the provision of validated examples, with rapid performance gains from as few as 20 validated examples. Incorporating validated explanations, not just labels, enables HLC to substantially outperform both standard few-shot prompting and fine-tuned BERT models while maintaining high recall. These results highlight how the in-context and Chain-of-Thought learning capabilities of LLMs enable adaptive classification approaches that move beyond one-size-fits-all models, creating opportunities for tools that learn continuously from stakeholder feedback.
翻译:自动化需求评估传统上依赖于通用模式作为缺陷的代理,通过基于规则的启发式方法或在大型标注数据集上训练的机器学习分类器实现。然而,"缺陷"的构成本质上是情境依赖的,会因项目、领域和利益相关者解释的不同而存在差异。本文提出一种人机协作方法,将缺陷预测视为自适应过程而非静态分类任务。该方法通过反馈循环利用大语言模型的思维链推理机制:用户验证预测结果及其解释说明,这些经过验证的示例通过小样本学习自适应地指导未来预测。我们在包含1,266条梅赛德斯-奔驰标注需求的QuRE基准上使用弱词异味进行评估。结果表明,该方法能有效适应已验证示例的输入,仅需20个验证示例即可实现性能快速提升。通过整合验证解释(而不仅是标签),该方法在保持高召回率的同时,显著优于标准小样本提示方法和微调BERT模型。这些发现揭示了大语言模型的上下文学习与思维链学习能力如何推动自适应分类方法超越通用模型,为持续从利益相关者反馈中学习的工具创造了发展机遇。