The Research

Catch the bugs your tests can't see.

AI writes a third of new production code. The bugs it ships are the silent ones.

ReGrade pays for itself catching the first bug. ReGrade catches +4 to +13 extra bugs per trial across 16 of 17 LLMs tested, at a marginal cost between negative (it pays for itself) and ≤16¢ per extra bug caught — versus an industry-typical $1,500–$50,000 per bug that ships to production.

+4 to +13

extra bugs caught per trial, across 16 of 17 LLMs

+825%

biggest single-model improvement (Haiku 4.5)

18/18

bugs fixed by every top-tier model with ReGrade

σ→0

same answer every run on Sonnet, Opus, GPT-5.5, Qwen3.6-Plus

200

trials in our pre-registered head-to-head test

17 / 6

LLMs / AI providers tested

📊 The test, by the numbers.

ReGrade is a behavioral-diff context block your AI coding agent reads alongside its existing prompt — it shows the agent how the new code's runtime behavior differs from the old, the way a human checks for regressions at code review. No model retraining. No changes to your build or CI pipeline.

What we tested. Each trial = one AI coding agent (running autonomously, the way Claude Code or Codex CLI operate — shell tools, file access, code edits) attempts to fix 18 known bugs planted in a 300,000-line Python codebase. 1,400+ trials across 17 LLMs from 6 AI providers.

LLMs

AI providers

1,400+

trials

known bugs

300,000-line Python

codebase

+825%

biggest gain

🛠️ Works with every major coding agent and model API.

3 agent CLIs tested · 17 LLMs across 6 providers.

Coding-agent CLIs

Claude Code

Haiku 4.5
Sonnet 4.6
Opus 4.6
Opus 4.7

Codex CLI

GPT-5.4
GPT-5.5

qwen-code

Qwen3-Coder
Qwen3-Coder-Plus
Qwen3.6-Plus
Qwen3.6-Max-Preview

Models tested via API

Gemini

Gemini 3 Flash Preview
Gemini 3.1 Pro Preview

Grok

Grok 3 Mini
Grok 3 Fast
Grok 4.20 Reasoning

DeepSeek

V4 Pro
V4 Flash

One ReGrade context block in the agent's prompt. No model retraining. No changes to your build or CI pipeline. Tested working across all 6 AI providers.

🎯 Every model improves. No exceptions.

+33% to +825% bug-fix improvement across the lineup. Without ReGrade, the best model fixes only 13.3 / 18. With ReGrade, four models hit the ceiling and every other one moves up.

Provider	Model	Without	With	Δ
Anthropic	Opus 4.7	9.1 / 18	18.0 / 18	+98%
Anthropic	Sonnet 4.6	5.5 / 18	17.7 / 18	+222%
OpenAI	GPT-5.5	12.3 / 18	18.0 / 18	+46%
OpenAI	GPT-5.4	13.3 / 18	18.0 / 18	+35%
DeepSeek	DeepSeek V4 Pro	5.3 / 18	16.2 / 18	+206%
xAI	Grok 4.20 Reasoning	3.7 / 18	15.7 / 18	+324%
Google	Gemini 3.1 Pro Preview	8.7 / 18	14.3 / 18	+64%
Alibaba	Qwen3.6-Max-Preview	5.7 / 18	12.8 / 18	+125%
Moonshot ★	Kimi K2.6 (native CLI)	6.2 / 18	16.1 / 18	+160%

Anthropic + OpenAI numbers from a length-controlled head-to-head extended to n=30 trials per model in v5.5. Other providers from the broader lineup test (n=3-4 trials per model). ★ Kimi K2.6 is a v6 preliminary cell at n=8 with engaged-subset means.

✅ The gains are real — proven, not assumed.

In a 600-trial head-to-head test whose design we registered in advance (so the results couldn't be cherry-picked after the fact), ReGrade beat all three comparison conditions — across every model tested.

Control condition	Bugs fixed	vs ReGrade
No extra context (just the test suite)	8.6 / 18	16.9 / 18
Random gibberish, same length as ReGrade	8.7 / 18	16.9 / 18
Unrelated source code, same length as ReGrade	8.4 / 18	16.9 / 18

What moves the needle is the behavioral signal in ReGrade, not the volume of extra context. Less than 1-in-a-billion-trillion chance the difference is a fluke.

⚡ Speed: 20–84% less real-world time per trial.

The ReGrade context cuts down on the back-and-forth where the agent runs shell commands and re-reads files trying to figure out what changed — the agent gets to the answer faster.

Provider	Model	Without	With	Speed-up
Alibaba	Qwen3.6-Max-Preview	50m 7s	8m 4s	84% faster
Google	Gemini 3.1 Pro Preview	10m 22s	6m 46s	35% faster
OpenAI	GPT-5.4	4m 1s	2m 48s	30% faster
OpenAI	GPT-5.5	5m 17s	4m 13s	20% faster
Alibaba	Qwen3.6-Plus	25m 26s	21m 25s	16% faster

Bonus: in four models (Sonnet 4.6, Opus 4.7, GPT-5.5, Qwen3.6-Plus) every trial produces the identical outcome — trial-to-trial randomness drops to zero. No more flaky CI retries.

💰 Cost per bug drops 23–89% on top-tier models.

Adding ReGrade reduces LLM input/output cost per trial AND finds more bugs, so the total cost per bug fixed is meaningfully lower with ReGrade than without it on most top-tier models. Even on the four models where it doesn't pay for itself outright, the marginal cost stays under ≤16¢ per extra bug caught — versus an industry-typical $1,500–$50,000 per bug that ships to production.

Provider	Model	Without $/bug	With $/bug	Change
Alibaba	Qwen3.6-Max-Preview	$0.333	$0.036	-89%
xAI	Grok 4.20 Reasoning	$0.235	$0.077	-67%
OpenAI	GPT-5.5	$0.156	$0.091	-42%
Anthropic	Sonnet 4.6	$0.126	$0.092	-27%
OpenAI	GPT-5.4	$0.020	$0.015	-25%
Anthropic	Opus 4.7	$0.257	$0.199	-23%

Six of eight top-tier models get cheaper per bug with ReGrade, with reductions from −23% to −89%.

🏆 Three recommended models, ranked by cost with ReGrade.

For teams choosing their AI coding agent today: best results-per-dollar with ReGrade.

🥇

Gemini 3 Flash Preview

17.7 / 18

bugs fixed

$0.008

per bug

🥈

GPT-5.4

17.7 / 18

bugs fixed

$0.015

per bug

🥉

DeepSeek V4 Pro

16.2 / 18

bugs fixed

$0.017

per bug

Five more model + ReGrade combinations clear the cheap-AND-effective threshold. Works equally well across providers.

ReGrade is a single addition to the agent's prompt — you pay only the marginal LLM tokens for that context, with no separate ReGrade fee per bug.

Want ReGrade for your AI coding agent?

Drop the context block into your agent's prompt. No retraining. No CI changes.

Try For Free Now Talk to Us