Language models that can learn a task at inference time, called in-context learning (ICL), show increasing promise in natural language inference tasks. In ICL, a model user constructs a prompt to describe a task with a natural language instruction and zero or more examples, called demonstrations. The prompt is then input to the language model to generate a completion. In this paper, we apply ICL to the design and evaluation of satisfaction arguments, which describe how a requirement is satisfied by a system specification and associated domain knowledge. The approach builds on three prompt design patterns, including augmented generation, prompt tuning, and chain-of-thought prompting, and is evaluated on a privacy problem to check whether a mobile app scenario and associated design description satisfies eight consent requirements from the EU General Data Protection Regulation (GDPR). The overall results show that GPT-4 can be used to verify requirements satisfaction with 96.7% accuracy and dissatisfaction with 93.2% accuracy. Inverting the requirement improves verification of dissatisfaction to 97.2%. Chain-of-thought prompting improves overall GPT-3.5 performance by 9.0% accuracy. We discuss the trade-offs among templates, models and prompt strategies and provide a detailed analysis of the generated specifications to inform how the approach can be applied in practice.
翻译:具备推理时任务学习能力的大语言模型(称为上下文学习,ICL)在自然语言推理任务中展现出日益广阔的应用前景。在ICL过程中,模型使用者通过自然语言指令与零个或多个示例(称为示范)构建提示,描述目标任务。该提示输入语言模型后生成补全内容。本文首次将ICL应用于满意度论证的设计与评估,该论证旨在说明系统规约与相关领域知识如何满足特定需求。该方法基于三种提示设计模式:增强生成、提示调优及思维链提示,并在隐私问题场景中进行评估——验证某移动应用场景及其设计描述是否满足欧盟《通用数据保护条例》(GDPR)的八项同意要求。总体结果显示:GPT-4验证需求满足的准确率达96.7%,验证需求未满足的准确率达93.2%;通过反转需求表述,未满足验证准确率提升至97.2%;思维链提示使GPT-3.5的整体性能提升9.0%。本文系统探讨了模板、模型与提示策略间的权衡取舍,并对生成规约进行详细分析,为实际应用该方法提供实践指导。