Open-Source AI Research

See How AI Models Differ in Censorship and Bias

We run identical prompts through every major LLM and measure exactly which models refuse — and which ones don't.

26+
Models Audited
GPT-4, Claude, Llama, Gemini, Grok, and more
1,989
Prompts Tested
Grounded in Wikipedia's controversial issues list
Biweekly
Auto-Updates
Scheduled GitHub Actions keep data fresh

How It Works

A transparent, reproducible pipeline from prompt to insight.

01

Curate Prompts

We select 200 sensitive-but-legitimate questions spanning politics, health, law, and culture — sourced from Wikipedia's list of controversial topics.

02

Run Every Model

The same system prompt hits every LLM via unified API calls. Responses are scored ALLOWED or REMOVED by an independent judge model.

03

Visualise the Gap

Statistical tests (McNemar's) confirm whether differences are real. Browse radar charts, heatmaps, and side-by-side disagreement logs.

Ready to explore the data?

Pick any two LLMs and instantly compare their censorship profiles, refusal rates, and specific disagreements.

Compare Models