Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped with an SMT solver, can audit such requirements: translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalization, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification. We present VERIMED, a neurosymbolic pipeline that operationalizes this idea for medical-device software requirements, and report two findings. First, stochastic variation across independent formalizations is a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test. Second, the usefulness of symbolic feedback depends on its granularity: in counterexample-guided repair on a hemodialysis question-answering benchmark, concrete SMT counterexamples raise verified accuracy from 55.4% to 98.5%. Over an extensive experimental evaluation on open-source hemodialysis safety requirements, we show that the LLM-based approach in VERIMED successfully reduces ambiguity-sensitive requirements and enables rigorous auditing of software requirements through SMT-based queries.
翻译:自然语言软件需求常常存在歧义、不一致和未充分说明的问题;在安全关键领域,这些缺陷会传播到验证错误规约的形式化模型中,并最终导致实现中出现不安全行为。我们证明,配备了SMT求解器的大型语言模型可以审计此类需求:将其转换为形式化逻辑,通过生成的形式化表达中的随机变化来检测歧义,并通过求解器查询暴露不一致性、空洞性和违反安全性的问题。我们提出了VERIMED——一种将该思想应用于医疗设备软件需求的神经符号流水线,并报告了两项发现。首先,独立形式化表达之间的随机变化是歧义的信号:允许多种合理解释的需求会产生SMT不等价的形式化表达,而双向SMT等价性检查将此不一致性转化为求解器可检验的测试。其次,符号反馈的有用性取决于其粒度:在血液透析问答基准上的反例引导修复中,具体的SMT反例将验证准确率从55.4%提升至98.5%。通过对开源血液透析安全需求的大量实验评估,我们证明VERIMED中基于LLM的方法成功减少了歧义敏感的需求,并实现了通过SMT查询对软件需求的严格审计。