Open-Source AI Research
We run identical prompts through every major LLM and measure exactly which models refuse — and which ones don't.
A transparent, reproducible pipeline from prompt to insight.
We select 200 sensitive-but-legitimate questions spanning politics, health, law, and culture — sourced from Wikipedia's list of controversial topics.
The same system prompt hits every LLM via unified API calls. Responses are scored ALLOWED or REMOVED by an independent judge model.
Statistical tests (McNemar's) confirm whether differences are real. Browse radar charts, heatmaps, and side-by-side disagreement logs.
Pick any two LLMs and instantly compare their censorship profiles, refusal rates, and specific disagreements.
Compare Models