Open Source

Stop choosing LLM providers by vibes.

Benchmark providers on your real agent tasks.

Agent Duelist is a TypeScript library and CLI that pits OpenAI, Anthropic, Gemini, Azure, and any OpenAI-compatible provider against each other — on the exact prompts and schemas your agents run.

  • One config file. Every provider. Real data.
  • Measure correctness, latency, tokens, and cost per task.
  • Built-in scorers + LLM-as-judge + custom metrics.
  • Drop into CI — quality gates for your prompts.
npx duelist init npx duelist run
duelist run
$ npx duelist run
gpt-5-mini × extract-company: 820ms $0.00014 ✓
claude-sonnet-4.6 × extract-company: 650ms $0.00018 ✓
gemini-3-flash-preview × extract-company: 440ms $0.00006 ✓
 
▶ Agent Duelist Results
Most correct: claude-sonnet-4.6 (avg 98%)
Fastest: gemini-3-flash-preview (avg 440ms)
Cheapest: gemini-3-flash-preview (avg $0.00006)

Readable tables and a clear winner — not 6 browser tabs of docs.

The problem

Everyone says "just switch providers." Nobody can show you data.

🎲

Gut-feel model choice

Provider selection is lobby-chat anecdotes and Twitter hype, not evidence.

📊

Generic benchmarks

Leaderboards test trivia. You need metrics on your prompts and schemas.

💰

Invisible costs

Token costs surface on the bill, not during development. Latency? Who knows.

What teams actually need

  • Run the same tasks across multiple providers, side by side.
  • Metrics that match reality: correctness, latency, tokens, cost.
  • Support for tools and structured outputs — agents, not just chat.
  • Something you can drop into CI and re-run as models change.
  • Data you can paste into a PR, not a meeting.

Agent Duelist is closer to Vitest for LLM providers than another lab leaderboard.

One config. Every provider. Real data.

Define an arena with providers, tasks, and scorers. Run all combinations. Collect structured results. Ship with confidence.

  • Providers — any OpenAI-compatible endpoint, or built-in factories for OpenAI, Azure, Anthropic, Gemini.
  • Tasks — your actual prompts, schemas, tools, and expected outputs.
  • Scorers — built-in (latency, cost, correctness) or custom functions.
  • Reporters — console tables, JSON, or markdown for PR comments.

You keep full code ownership: runs in your project, your CI, your API keys.

In code

arena.config.ts
import { defineArena, openai, anthropic } from "agent-duelist"
import { z } from "zod"
 
export default defineArena({
providers: [
openai("gpt-5-mini"),
anthropic("claude-sonnet-4.6"),
],
tasks: [{
name: "extract-company",
prompt: "Extract the company...",
schema: z.object({
company: z.string()
}),
expected: { company: "Acme" },
}],
scorers: ["latency", "cost", "correctness"],
})

Live demo

From zero to results in seconds. Watch a duel unfold.

~/my-project

Built-in scorers

Seven scoring dimensions out of the box. Mix and match, or bring your own.

Correctness

Exact match or deep-equal comparison against expected output.

Latency

Normalized response time — fastest provider scores 1.0.

💰

Cost

Estimated token cost from pricing catalog. Cheapest scores 1.0.

🧠

LLM-as-Judge

Let a judge model score freeform outputs on custom criteria.

📄

JSON Schema

Validates structured output against your Zod schema.

🔢

Token Count

Track prompt + completion tokens per provider per task.

🔧

Custom

Pure function: (ctx) => ScoreResult. Full access to input, output, metadata.

Built-in task packs

Curated benchmark suites — run with --pack, zero task writing needed. Each pack ships with recommended scorers tuned for its domain.

📄

structured-output

6 tasks — Zod schema stress test: flat objects, nesting, arrays, enums, empty arrays, adversarial input.

🔧

tool-calling

4 tasks — Function invocation accuracy: single calls, complex params, tool selection, parallel calls.

🧠

reasoning

5 tasks — Logic, math, multi-step thinking: arithmetic, deduction, data interpretation, critical path.

Zero-config benchmarks

terminal
# List available packs
$ npx duelist run --pack list
 
structured-output 6 tasks Zod schema stress test
tool-calling 4 tasks Function invocation accuracy
reasoning 5 tasks Logic, math, multi-step thinking
 
# Run a single pack
$ npx duelist run --pack tool-calling
 
# Combine packs
$ npx duelist run --pack structured-output,reasoning

Your config only needs providers — packs supply tasks and scorers. 15 benchmark tasks ready to go.

Quality gates for your prompts

duelist ci runs your arena, compares against baselines, and fails the build on regressions. Ship prompt changes with the same confidence as code changes.

.github/workflows/duelist.yml
- name: Run duels
run: npx duelist ci
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  • Regression detection — alerts when scores drop vs. baseline.
  • Cost budgets — fail if total cost exceeds threshold.
  • Auto-comment PRs — posts a comparison table on every push.

PR comment preview

duelist bot just now

Agent Duelist CI Results

Provider Correct Latency Cost Status
gpt-5-mini 96% 820ms $0.0014 🟢
claude-sonnet-4.6 98% 650ms $0.0018 🟢
gpt-5-nano 82% 310ms $0.0004 🔴
gemini-3-flash 94% 440ms $0.0006

🔴 gpt-5-nano: correctness dropped 12% vs baseline

Works with every provider

Built-in factories for the big names. Any OpenAI-compatible endpoint just works.

🇬🇵
OpenAI
gpt-5-mini, gpt-5-nano
Anthropic
claude-sonnet-4.6
♦️
Gemini
gemini-3-flash-preview
☁️
Azure OpenAI
gpt-5.2-chat
🔌
OpenRouter
Any model
🔧
Custom
Self-hosted / local

Adding a provider: 1 line

arena.config.ts
providers: [
// Built-in factories
openai("gpt-5-mini"),
openai("gpt-5-nano"),
 
// Any OpenAI-compatible API
openai("claude-sonnet-4.6", {
baseURL: "https://api.anthropic.com/v1",
}),
 
// Local / self-hosted
openai("llama-4-scout", {
baseURL: "http://localhost:11434/v1",
}),
]

Anything that speaks the OpenAI chat completions API is a valid provider. No adapters, no plugins, no SDK lock-in.

Run your first duel tonight.

Three commands. Five minutes. Real data on which provider wins for your use case.

get started
$ npm install agent-duelist
$ npx duelist init
$ npx duelist run
 
✓ 3 tasks × 2 providers completed
▶ Winner: claude-sonnet-4.6 (98% correct, 650ms)

Add to your CI in 30 seconds

.github/workflows/duelist.yml
name: LLM Quality Gate
on: [push, pull_request]
jobs:
duel:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
- run: npm ci
- run: npx duelist ci
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Quality gates for prompts, just like tests for code. Never ship a regression again.