Stop choosing LLM providers by vibes.
Agent Duelist is a TypeScript library and CLI that pits OpenAI, Anthropic, Gemini, Azure, and any OpenAI-compatible provider against each other — on the exact prompts and schemas your agents run.
npx duelist init
→
npx duelist run
Readable tables and a clear winner — not 6 browser tabs of docs.
Everyone says "just switch providers." Nobody can show you data.
Provider selection is lobby-chat anecdotes and Twitter hype, not evidence.
Leaderboards test trivia. You need metrics on your prompts and schemas.
Token costs surface on the bill, not during development. Latency? Who knows.
Agent Duelist is closer to Vitest for LLM providers than another lab leaderboard.
Define an arena with providers, tasks, and scorers. Run all combinations. Collect structured results. Ship with confidence.
You keep full code ownership: runs in your project, your CI, your API keys.
From zero to results in seconds. Watch a duel unfold.
Seven scoring dimensions out of the box. Mix and match, or bring your own.
Exact match or deep-equal comparison against expected output.
Normalized response time — fastest provider scores 1.0.
Estimated token cost from pricing catalog. Cheapest scores 1.0.
Let a judge model score freeform outputs on custom criteria.
Validates structured output against your Zod schema.
Track prompt + completion tokens per provider per task.
Pure function: (ctx) => ScoreResult. Full access to input, output, metadata.
Curated benchmark suites — run with --pack, zero task writing needed.
Each pack ships with recommended scorers tuned for its domain.
6 tasks — Zod schema stress test: flat objects, nesting, arrays, enums, empty arrays, adversarial input.
4 tasks — Function invocation accuracy: single calls, complex params, tool selection, parallel calls.
5 tasks — Logic, math, multi-step thinking: arithmetic, deduction, data interpretation, critical path.
Your config only needs providers — packs supply tasks and scorers. 15 benchmark tasks ready to go.
duelist ci runs your arena, compares against baselines, and fails
the build on regressions. Ship prompt changes with the same confidence as code changes.
Agent Duelist CI Results
| Provider | Correct | Latency | Cost | Status |
|---|---|---|---|---|
| gpt-5-mini | 96% | 820ms | $0.0014 | 🟢 |
| claude-sonnet-4.6 | 98% | 650ms | $0.0018 | 🟢 |
| gpt-5-nano | 82% | 310ms | $0.0004 | 🔴 |
| gemini-3-flash | 94% | 440ms | $0.0006 | ⚪ |
🔴 gpt-5-nano: correctness dropped 12% vs baseline
Built-in factories for the big names. Any OpenAI-compatible endpoint just works.
Anything that speaks the OpenAI chat completions API is a valid provider. No adapters, no plugins, no SDK lock-in.
Three commands. Five minutes. Real data on which provider wins for your use case.
Quality gates for prompts, just like tests for code. Never ship a regression again.