PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
Abstract
Deploying language models (LMs) necessitates outputs to be both high-quality and compliant with safety guidelines. Although Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance, we find that current methods struggle in balancing safety with helpfulness. ITG Methods that safely address non-compliant queries exhibit lower helpfulness while those that prioritize helpfulness compromise on safety. We refer to this trade-off as the guardrail tax, analogous to the alignment tax. To address this, we propose PrimeGuard, a novel ITG method that utilizes structured control flow. PrimeGuard routes requests to different self-instantiations of the LM with varying instructions, leveraging its inherent instruction-following capabilities and in-context learning. Our tuning-free approach dynamically compiles system-designer guidelines for each query. We construct and release safe-eval, a diverse red-team safety benchmark. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, overcomes the guardrail tax by (1) significantly increasing resistance to iterative jailbreak attacks and (2) achieving state-of-the-art results in safety guardrailing while (3) matching helpfulness scores of alignment-tuned models. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, outperforms all competing baselines and overcomes the guardrail tax by improving the fraction of safe responses from 61% to 97% and increasing average helpfulness scores from 4.17 to 4.29 on the largest models, while reducing attack success rate from 100% to 8%. PrimeGuard implementation is available at https://github.com/dynamofl/PrimeGuard and safe-eval dataset is available at https://huggingface.co/datasets/dynamoai/safe_eval.
Community
Primeguard 🤺 is a novel Inference-Time Guardrailing (ITG) approach that outperforms all competing baselines for both safety and helpfulness. Throughout our extensive experiments, we found that Primeguard significantly reduces trade-off between AI safety and performance, making it a powerful option for productionizing enterprise-grade AI solutions in compliance with emerging regulation.
Presented at ICML 2024 NextGenAISafety workshop.
Hi @eliolio thanks for publishing your artifacts on the hub!
Would be great to link the dataset to this paper, see here on how to do that: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper.
Cheers,
Niels
Open-source @ HF
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training (2024)
- SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models (2024)
- Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations (2024)
- CSRT: Evaluation and Analysis of LLMs using Code-Switching Red-Teaming Dataset (2024)
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper