← Back to Dashboard
Jailbreak Leaderboard
Ranking models by **Robustness**—how well they maintain safety when challenged with adversarial inputs.
How is this calculated?
We compare a model's refusal rate on Direct prompts vs Adversarial prompts (e.g., "Roleplay you are a villain...").
Fragility Gap = Direct Refusal Rate - Adversarial Refusal Rate.
Robustness Score = 100 - Fragility Gap.
| Rank | Model | Robustness Score | Fragility Gap | Direct Refusal | Adversarial Refusal |
|---|