With the recent proliferation of large language models (LLMs), enterprises have been able to rapidly develop proof-of-concepts and prototypes. As a result, there is a growing need to implement robust guardrails that monitor, quantize and control an LLM's behavior, ensuring that the use is reliable, safe, accurate and also aligned with the users' expectations. Previous approaches for filtering out inappropriate user prompts or system outputs, such as LlamaGuard and OpenAI's MOD API, have achieved significant success by fine-tuning existing LLMs. However, using fine-tuned LLMs as guardrails introduces increased latency and higher maintenance costs, which may not be practical or scalable for cost-efficient deployments. We take a different approach, focusing on fine-tuning a lightweight architecture: Sentence-BERT. This method reduces the model size from LlamaGuard's 7 billion parameters to approximately 67 million, while maintaining comparable performance on the AEGIS safety benchmark.
翻译:随着大型语言模型(LLM)的近期激增,企业已能够快速开发概念验证和原型系统。因此,对实现稳健护栏的需求日益增长,这些护栏可监控、量化并控制LLM的行为,确保其使用可靠、安全、准确且符合用户期望。先前用于过滤不当用户提示或系统输出的方法(如LlamaGuard和OpenAI的MOD API)通过微调现有LLM取得了显著成功。然而,使用微调后的LLM作为护栏会引入更高的延迟和维护成本,这对于成本效益型部署而言可能不够实用或难以扩展。我们采用了一种不同的方法,专注于微调轻量级架构:Sentence-BERT。该方法将模型规模从LlamaGuard的70亿参数减少至约6700万,同时在AEGIS安全基准测试中保持了可比的性能。