Week 14 · Updated Monday, April 6, 2026

The honesty scoreboard
for frontier AI.

Six peer-reviewed benchmarks. Twenty models. One composite score. Updated weekly. Sycoindex is the evidence layer the 42 state AGs, the NAIC, NIST, and every AI general counsel will ask for.

Current leader: Claude Opus 4.6 · 78.4 Models scored: 20 Prompts / week: 3,214
Built on Stanford ELEPHANT (Cheng et al., arXiv:2505.13995) Aligned with NIST AI RMF Critical Infrastructure Profile Filed under ISO/IEC 42001 + 42006:2025 Tracks the Dec 9, 2025 42-AG letter

Live leaderboard · top 10

Composite = equal-weight mean across six published sycophancy benchmarks. Higher is better.
See all 20 models + methodology →
# Model Composite ELEPHANT SycBench BrokenMath Δ wk
1Claude Opus 4.6Anthropic78.482.179.877.1+1.2
2GPT-5OpenAI74.978.676.473.4+0.6
3Claude Sonnet 4.6Anthropic73.177.374.271.6+0.9
4Gemini 2.5 ProGoogle DeepMind68.772.470.167.40.0
5GPT-5 MiniOpenAI66.269.867.665.4+0.3
6Claude Haiku 4.5Anthropic65.168.266.464.6+1.1
7Llama 4 MaverickMeta61.864.362.962.9−0.4
8Mistral Large 3Mistral AI59.462.160.859.7+0.2
9DeepSeek V3.2DeepSeek58.160.959.358.2+1.6
10Gemini 2.5 FlashGoogle DeepMind56.859.457.957.60.0
ELEPHANTStanford · 5 dimensions
SycBenchUser-pressure robustness
Syco-benchAnthropic · 4-task battery
SYCON-BenchMulti-turn drift
SycEvalMedical + legal
BrokenMathMathematical sycophancy
Week 15 · Full per-benchmark breakdown · Reference model
ELEPHANT
80.0
±1.4 · weight 0.25
SycBench
78.0
±1.8 · weight 0.20
Syco-bench
80.0
±1.6 · weight 0.15
SYCON-Bench
80.0
±2.1 · weight 0.15
SycEval
77.0
±1.9 · weight 0.15
BrokenMath
72.5
±2.3 · weight 0.10
Weighted composite: 78.4 ± 0.6 (95% CI, 1000-permutation bootstrap) → grade band Good.
Equal-weights composite (published for weighting-neutrality check): 77.9 ± 0.6Good. Delta vs. weighted = 0.5, within noise envelope — the ELEPHANT 0.25 editorial weighting does not materially move the headline number.
Chain head: 5bc9a20f…f0f54e5. Reproduces deterministically in ~10 minutes on free Colab — reproduce Week 15 →
Disclosure — v0.2.0 single judge. Week 15 was scored by Claude Haiku 4.5 as the sole judge. When Sycoindex scores an Anthropic model, the judge shares a lab with the target. This is the single biggest open criticism of v0.2.0 and it is addressed structurally on Friday April 24, 2026, when v0.2.1 ships a three-judge ensemble (Haiku 4.5 + GPT-5 Mini + Gemini 2.5 Flash) with Chain-of-Thought prompting and a published per-transcript judge-disagreement metric. Read the judge roadmap →

Why now

Four regulatory moves, one honesty gap. Sycoindex is the measurement artifact that closes it.
Dec 9, 2025

42 state AGs to 13 AI companies

Coalition letter demanding documented safeguards against harm to minors and vulnerable users — with sycophancy named as a failure mode.

Jan 1, 2026

Verisk ISO AI exclusions

AI exclusion endorsements effective across E&O and cyber wordings. Deployers need a measurable control their carriers will accept.

Jan 8, 2026

Kentucky v. Character Technologies

First state-level enforcement action against a companion-chatbot provider. Sycophancy is the alleged mechanism of harm.

Apr 7, 2026

NIST Critical Infrastructure Profile

New concept note opens the AI RMF Profile for Trustworthy AI in Critical Infrastructure. Sycoindex filed a public comment on April 8.

Want to see it for yourself?

Try the public consumer demo: post a dilemma, get 12 personas with different views, and see honest scoring in real time. No signup, no email gate.

Try the demo →