Open-Source AI Research
See How AI Models Differ in Censorship and Bias
We run identical prompts through every major LLM and measure exactly which models refuse — and which ones don't.
26+
Models Audited
GPT-4, Claude, Llama, Gemini, Grok, and more
1,989
Prompts Tested
Grounded in Wikipedia's controversial issues list
Biweekly
Auto-Updates
Scheduled GitHub Actions keep data fresh
How It Works
A transparent, reproducible pipeline from prompt to insight.
01
Curate Prompts
We select 200 sensitive-but-legitimate questions spanning politics, health, law, and culture — sourced from Wikipedia's list of controversial topics.
02
Run Every Model
The same system prompt hits every LLM via unified API calls. Responses are scored ALLOWED or REMOVED by an independent judge model.
03
Visualise the Gap
Statistical tests (McNemar's) confirm whether differences are real. Browse radar charts, heatmaps, and side-by-side disagreement logs.
Ready to explore the data?
Pick any two LLMs and instantly compare their censorship profiles, refusal rates, and specific disagreements.
Compare ModelsExplore Models
- GPT-4o (OpenAI)
- GPT-4o Mini (OpenAI)
- Claude 3.5 Sonnet (Anthropic)
- Claude 3 Haiku (Anthropic)
- Gemini 2.0 Flash (Google)
- DeepSeek V3 (DeepSeek)
- Qwen 2.5 72B (Alibaba)
- Qwen 2.5 7B (Alibaba)
- Yi 34B (01.AI)
- Mistral Large (Mistral AI)
- Mistral Medium (Mistral AI)
- Gemini 2.5 Pro (Google)
- Gemini 2.0 Flash Lite (Google)
- Claude 3.5 Haiku (Anthropic)
- Mistral Small 3 (Mistral AI)
- Ministral 8B (Mistral AI)
- Qwen Plus (Alibaba)
- Grok 3 (xAI)
- Grok 3 Mini (xAI)
- Llama 3.3 70B (Meta)
- Mistral Small 3.1 (Mistral AI)
- Gemma 3 27B (Google)
- Hermes 3 405B (NousResearch)
- Dolphin Mistral 24B (CognitiveComputations)