Sycoindex — The Sycophancy Leaderboard for LLMs

Live leaderboard · top 10

Composite = equal-weight mean across six published sycophancy benchmarks. Higher is better.

#	Model	Composite	ELEPHANT	SycBench	BrokenMath	Δ wk
1	Claude Opus 4.6Anthropic	78.4	82.1	79.8	77.1	+1.2
2	GPT-5OpenAI	74.9	78.6	76.4	73.4	+0.6
3	Claude Sonnet 4.6Anthropic	73.1	77.3	74.2	71.6	+0.9
4	Gemini 2.5 ProGoogle DeepMind	68.7	72.4	70.1	67.4	0.0
5	GPT-5 MiniOpenAI	66.2	69.8	67.6	65.4	+0.3
6	Claude Haiku 4.5Anthropic	65.1	68.2	66.4	64.6	+1.1
7	Llama 4 MaverickMeta	61.8	64.3	62.9	62.9	−0.4
8	Mistral Large 3Mistral AI	59.4	62.1	60.8	59.7	+0.2
9	DeepSeek V3.2DeepSeek	58.1	60.9	59.3	58.2	+1.6
10	Gemini 2.5 FlashGoogle DeepMind	56.8	59.4	57.9	57.6	0.0

ELEPHANTStanford · 5 dimensions

SycBenchUser-pressure robustness

Syco-benchAnthropic · 4-task battery

SYCON-BenchMulti-turn drift

SycEvalMedical + legal

BrokenMathMathematical sycophancy

Week 15 · Full per-benchmark breakdown · Reference model

ELEPHANT
80.0
±1.4 · weight 0.25
SycBench
78.0
±1.8 · weight 0.20
Syco-bench
80.0
±1.6 · weight 0.15
SYCON-Bench
80.0
±2.1 · weight 0.15
SycEval
77.0
±1.9 · weight 0.15
BrokenMath
72.5
±2.3 · weight 0.10

Weighted composite: 78.4 ± 0.6 (95% CI, 1000-permutation bootstrap) → grade band Good.
Equal-weights composite (published for weighting-neutrality check): 77.9 ± 0.6 → Good. Delta vs. weighted = 0.5, within noise envelope — the ELEPHANT 0.25 editorial weighting does not materially move the headline number.
Chain head: 5bc9a20f…f0f54e5. Reproduces deterministically in ~10 minutes on free Colab — reproduce Week 15 →

Disclosure — v0.2.0 single judge. Week 15 was scored by Claude Haiku 4.5 as the sole judge. When Sycoindex scores an Anthropic model, the judge shares a lab with the target. This is the single biggest open criticism of v0.2.0 and it is addressed structurally on Friday April 24, 2026, when v0.2.1 ships a three-judge ensemble (Haiku 4.5 + GPT-5 Mini + Gemini 2.5 Flash) with Chain-of-Thought prompting and a published per-transcript judge-disagreement metric. Read the judge roadmap →

Why now

Four regulatory moves, one honesty gap. Sycoindex is the measurement artifact that closes it.

Dec 9, 2025

42 state AGs to 13 AI companies

Coalition letter demanding documented safeguards against harm to minors and vulnerable users — with sycophancy named as a failure mode.

Jan 1, 2026

Verisk ISO AI exclusions

AI exclusion endorsements effective across E&O and cyber wordings. Deployers need a measurable control their carriers will accept.

Jan 8, 2026

Kentucky v. Character Technologies

First state-level enforcement action against a companion-chatbot provider. Sycophancy is the alleged mechanism of harm.

Apr 7, 2026

NIST Critical Infrastructure Profile

New concept note opens the AI RMF Profile for Trustworthy AI in Critical Infrastructure. Sycoindex filed a public comment on April 8.

The honesty scoreboard
for frontier AI.

The evidence layer

The carrier packet

The Honesty API