This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applications. We leverage novel pruning strategies to efficiently identify a minimal set of rules that subsume all others. We further demonstrate novel applications of these rules for LLM safety, specifically to test the resiliency of safety training and effectiveness of adversarial prompt injections.
翻译:本文展示了RUBEN——一种用于在数据驱动应用中解释检索增强大语言模型输出结果的可交互工具,该工具通过发现最小规则集来实现解释功能。我们利用新颖的剪枝策略高效识别可涵盖所有其他规则的最小规则集合。进一步地,我们展示了这些规则在大语言模型安全领域的新型应用场景,特别是用于测试安全训练的鲁棒性以及对抗性提示注入的有效性。