Skip to main content
Curtail

The Research

Catch the bugs your tests can't see.

AI writes a third of new production code. The bugs it ships are the silent ones.

ReGrade pays for itself catching the first bug. ReGrade catches +4 to +13 extra bugs per trial across 16 of 17 LLMs tested, at a marginal cost between negative (it pays for itself) and ≤16¢ per extra bug caught — versus an industry-typical $1,500–$50,000 per bug that ships to production.

+4 to +13

extra bugs caught per trial, across 16 of 17 LLMs

+825%

biggest single-model improvement (Haiku 4.5)

18/18

bugs fixed by every top-tier model with ReGrade

σ→0

same answer every run on Sonnet, Opus, GPT-5.5, Qwen3.6-Plus

200

trials in our pre-registered head-to-head test

17 / 6

LLMs / AI providers tested

📊 The test, by the numbers.

ReGrade is a behavioral-diff context block your AI coding agent reads alongside its existing prompt — it shows the agent how the new code's runtime behavior differs from the old, the way a human checks for regressions at code review. No model retraining. No changes to your build or CI pipeline.

What we tested. Each trial = one AI coding agent (running autonomously, the way Claude Code or Codex CLI operate — shell tools, file access, code edits) attempts to fix 18 known bugs planted in a 300,000-line Python codebase. 1,400+ trials across 17 LLMs from 6 AI providers.

17

LLMs

6

AI providers

1,400+

trials

18

known bugs

300,000-line Python

codebase

+825%

biggest gain

🛠️ Works with every major coding agent and model API.

3 agent CLIs tested · 17 LLMs across 6 providers.

Coding-agent CLIs

Anthropic

Claude Code

  • Haiku 4.5
  • Sonnet 4.6
  • Opus 4.6
  • Opus 4.7
OpenAI

Codex CLI

  • GPT-5.4
  • GPT-5.5
QWen

qwen-code

  • Qwen3-Coder
  • Qwen3-Coder-Plus
  • Qwen3.6-Plus
  • Qwen3.6-Max-Preview

Models tested via API

Google Gemini

Gemini

  • Gemini 3 Flash Preview
  • Gemini 3.1 Pro Preview
X

Grok

  • Grok 3 Mini
  • Grok 3 Fast
  • Grok 4.20 Reasoning
DeepSeek

DeepSeek

  • V4 Pro
  • V4 Flash

One ReGrade context block in the agent's prompt. No model retraining. No changes to your build or CI pipeline. Tested working across all 6 AI providers.

🎯 Every model improves. No exceptions.

+33% to +825% bug-fix improvement across the lineup. Without ReGrade, the best model fixes only 13.3 / 18. With ReGrade, four models hit the ceiling and every other one moves up.

ProviderModelWithoutWithΔ
AnthropicAnthropicOpus 4.79.1 / 1818.0 / 18+98%
AnthropicAnthropicSonnet 4.65.5 / 1817.7 / 18+222%
OpenAIOpenAIGPT-5.512.3 / 1818.0 / 18+46%
OpenAIOpenAIGPT-5.413.3 / 1818.0 / 18+35%
DeepSeekDeepSeekDeepSeek V4 Pro5.3 / 1816.2 / 18+206%
XxAIGrok 4.20 Reasoning3.7 / 1815.7 / 18+324%
Google GeminiGoogleGemini 3.1 Pro Preview8.7 / 1814.3 / 18+64%
QWenAlibabaQwen3.6-Max-Preview5.7 / 1812.8 / 18+125%
MoonshotMoonshot ★Kimi K2.6 (native CLI)6.2 / 1816.1 / 18+160%

Anthropic + OpenAI numbers from a length-controlled head-to-head extended to n=30 trials per model in v5.5. Other providers from the broader lineup test (n=3-4 trials per model). ★ Kimi K2.6 is a v6 preliminary cell at n=8 with engaged-subset means.

✅ The gains are real — proven, not assumed.

In a 600-trial head-to-head test whose design we registered in advance (so the results couldn't be cherry-picked after the fact), ReGrade beat all three comparison conditions — across every model tested.

Control conditionBugs fixedvs ReGrade
No extra context (just the test suite)8.6 / 1816.9 / 18
Random gibberish, same length as ReGrade8.7 / 1816.9 / 18
Unrelated source code, same length as ReGrade8.4 / 1816.9 / 18

What moves the needle is the behavioral signal in ReGrade, not the volume of extra context. Less than 1-in-a-billion-trillion chance the difference is a fluke.

⚡ Speed: 20–84% less real-world time per trial.

The ReGrade context cuts down on the back-and-forth where the agent runs shell commands and re-reads files trying to figure out what changed — the agent gets to the answer faster.

ProviderModelWithoutWithSpeed-up
QWenAlibabaQwen3.6-Max-Preview50m 7s8m 4s84% faster
Google GeminiGoogleGemini 3.1 Pro Preview10m 22s6m 46s35% faster
OpenAIOpenAIGPT-5.44m 1s2m 48s30% faster
OpenAIOpenAIGPT-5.55m 17s4m 13s20% faster
QWenAlibabaQwen3.6-Plus25m 26s21m 25s16% faster

Bonus: in four models (Sonnet 4.6, Opus 4.7, GPT-5.5, Qwen3.6-Plus) every trial produces the identical outcome — trial-to-trial randomness drops to zero. No more flaky CI retries.

💰 Cost per bug drops 23–89% on top-tier models.

Adding ReGrade reduces LLM input/output cost per trial AND finds more bugs, so the total cost per bug fixed is meaningfully lower with ReGrade than without it on most top-tier models. Even on the four models where it doesn't pay for itself outright, the marginal cost stays under ≤16¢ per extra bug caught — versus an industry-typical $1,500–$50,000 per bug that ships to production.

ProviderModelWithout $/bugWith $/bugChange
QWenAlibabaQwen3.6-Max-Preview$0.333$0.036-89%
XxAIGrok 4.20 Reasoning$0.235$0.077-67%
OpenAIOpenAIGPT-5.5$0.156$0.091-42%
AnthropicAnthropicSonnet 4.6$0.126$0.092-27%
OpenAIOpenAIGPT-5.4$0.020$0.015-25%
AnthropicAnthropicOpus 4.7$0.257$0.199-23%

Six of eight top-tier models get cheaper per bug with ReGrade, with reductions from −23% to −89%.

Want ReGrade for your AI coding agent?

Drop the context block into your agent's prompt. No retraining. No CI changes.