Green Shielding: A User-Centric Approach Towards Trustworthy AI

Large language models (LLMs) are increasingly deployed, yet their outputs can be highly sensitive to routine, non-adversarial variation in how users phrase queries, a gap not well addressed by existing red-teaming efforts. We propose Green Shielding, a user-centric agenda for building evidence-backed deployment guidance by characterizing how benign input variation shifts model behavior. We operationalize this agenda through the CUE criteria: benchmarks with authentic Context, reference standards and metrics that capture true Utility, and perturbations that reflect realistic variations in the Elicitation of model behavior. Guided by the PCS framework and developed with practicing physicians, we instantiate Green Shielding in medical diagnosis through HealthCareMagic-Diagnosis (HCM-Dx), a benchmark of patient-authored queries, together with structured reference diagnosis sets and clinically grounded metrics for evaluating differential diagnosis lists. We also study perturbation regimes that capture routine input variation and show that prompt-level factors shift model behavior along clinically meaningful dimensions. Across multiple frontier LLMs, these shifts trace out Pareto-like tradeoffs. In particular, neutralization, which removes common user-level factors while preserving clinical content, increases plausibility and yields more concise, clinician-like differentials, but reduces coverage of highly likely and safety-critical conditions. Together, these results show that interaction choices can systematically shift task-relevant properties of model outputs and support user-facing guidance for safer deployment in high-stakes domains. Although instantiated here in medical diagnosis, the agenda extends naturally to other decision-support settings and agentic AI systems.

翻译：大语言模型（LLMs）正被日益广泛部署，但其输出对用户提问方式的常规、非对抗性变化高度敏感，这一缺口尚未被现有红队测试方法充分解决。我们提出“绿色屏蔽”（Green Shielding），这是一种以用户为中心的研究纲领，通过刻画良性输入变化如何改变模型行为来构建基于证据的部署指南。我们通过CUE标准来具体实施这一纲领：包含真实语境的基准测试集、能捕捉真实效用的参考标准与指标，以及反映模型行为启发方式中现实变化的扰动方法。在PCS框架指导下，我们与执业医师合作，将绿色屏蔽应用于医疗诊断领域，开发了HealthCareMagic-Diagnosis（HCM-Dx）基准测试集——该数据集包含患者撰写的问诊记录，并配套结构化参考诊断集和基于临床实践的鉴别诊断列表评估指标。我们还研究了捕捉常规输入变化的扰动模式，发现提示层面的因素会沿着具有临床意义的维度改变模型行为。在多个前沿LLM上的实验表明，这些变化呈现出类似帕累托的权衡关系。特别值得注意的是，中性化处理（在保留临床内容的同时移除常见用户层面因素）在提升合理性的同时，能生成更简洁、更接近临床医生的鉴别诊断结果，但会降低对高可能性及安全关键病症的覆盖度。这些结果共同表明，交互选择可以系统性地改变模型输出中与任务相关的属性，并为高风险领域更安全的部署提供面向用户的指导。虽然本文以医疗诊断为实例，但该研究纲领可自然扩展到其他决策支持场景及智能体AI系统。