Large language models (LLMs) excel at natural language reasoning but remain unreliable on tasks requiring strict rule adherence, determinism, and auditability. Logic Sketch Prompting (LSP) is a lightweight prompting framework that introduces typed variables, deterministic condition evaluators, and a rule based validator that produces traceable and repeatable outputs. Using two pharmacologic logic compliance tasks, we benchmark LSP against zero shot prompting, chain of thought prompting, and concise prompting across three open weight models: Gemma 2, Mistral, and Llama 3. Across both tasks and all models, LSP consistently achieves the highest accuracy (0.83 to 0.89) and F1 score (0.83 to 0.89), substantially outperforming zero shot prompting (0.24 to 0.60), concise prompts (0.16 to 0.30), and chain of thought prompting (0.56 to 0.75). McNemar tests show statistically significant gains for LSP across nearly all comparisons (p < 0.01). These results demonstrate that LSP improves determinism, interpretability, and consistency without sacrificing performance, supporting its use in clinical, regulated, and safety critical decision support systems.
翻译:大型语言模型(LLMs)在自然语言推理方面表现出色,但在需要严格遵循规则、确定性和可审计性的任务上仍不可靠。逻辑草图提示法(LSP)是一种轻量级提示框架,它引入了类型化变量、确定性条件评估器和一个基于规则的验证器,以产生可追踪和可重复的输出。通过两项药理学逻辑合规任务,我们在三个开放权重模型(Gemma 2、Mistral 和 Llama 3)上,将LSP与零样本提示、思维链提示和简洁提示进行了基准测试。在所有任务和所有模型中,LSP始终获得最高的准确率(0.83至0.89)和F1分数(0.83至0.89),显著优于零样本提示(0.24至0.60)、简洁提示(0.16至0.30)和思维链提示(0.56至0.75)。McNemar检验表明,在几乎所有比较中,LSP均取得了统计学上显著的提升(p < 0.01)。这些结果表明,LSP在不牺牲性能的前提下,提高了确定性、可解释性和一致性,支持其在临床、受监管及安全关键决策支持系统中的应用。