Updated weekly

ArgusBench

The LLM leaderboard for real tasks, not synthetic benchmarks.We pick the right model so your agents don't have to.

25 task categories59 models tracked0 platform runs← argusflow.ai
How we rank: Each entry combines (1) public benchmarks, (2) our internal eval suite of representative prompts per task, and (3) production telemetry from agents on ArgusFlow. Where the three disagree, our editorial team picks based on real-world reliability. We have no provider partnerships — only what wins on outcome wins on the leaderboard.

reasoning

4 tasks

Agent Planning / Task Decomposition

Break a goal into sub-tasks, decide which agent/tool handles each

example: Decompose: build a Spanish-language product launch plan
1
Claude Opus 4.7anthropic/claude-opus-4-7frontier

top decomposition + tool-call reasoning

in: $15.00/Mout: $75.00/Mp50: 4.5s
2
o3 (reasoning)openai/o3reasoning

strong at multi-step plans

in: $15.00/Mout: $60.00/Mp50: 8.5s
3
GPT-5openai/gpt-5frontier

reliable parallel tool use

in: $2.50/Mout: $10.00/Mp50: 3.5s

Financial Analysis

P&L modeling, ratio computation, earnings call summarization

example: Compute YoY revenue growth from this 10-K excerpt
1
Claude Opus 4.7anthropic/claude-opus-4-7frontier

strongest at numerical reasoning + footnotes

in: $15.00/Mout: $75.00/Mp50: 4.5s
2
o3 (reasoning)openai/o3reasoning

specialized math reasoning

in: $15.00/Mout: $60.00/Mp50: 8.5s
3
GPT-5openai/gpt-5frontier

reliable on standard finance queries

in: $2.50/Mout: $10.00/Mp50: 3.5s

Multi-step Reasoning

Math, planning, logical deduction, multi-hop questions — anything requiring step-by-step thought

example: Plan a 3-week marketing rollout given these constraints: ...
1
Claude Opus 4.7anthropic/claude-opus-4-7frontier

top reasoning benchmarks, 1M context

in: $15.00/Mout: $75.00/Mp50: 4.5s
2
o3 (reasoning)openai/o3reasoning

specialized chain-of-thought training

in: $15.00/Mout: $60.00/Mp50: 8.5s
3
DeepSeek R1deepseek/deepseek-reasonerreasoning

reasoning quality at 1/30th the cost

in: $0.55/Mout: $2.19/Mp50: 6.0s

Data & Extraction

14 tasks

Audio Transcription

Speech-to-text, meeting transcripts, voicemail capture

example: Transcribe this 30-minute call
1
whisper-large-v3groq/whisper-large-v3

Groq-accelerated Whisper, fastest

Classification

Sentiment, intent, ICP scoring, content moderation — single-label or multi-label outputs

example: Score this lead 1-10 against ICP rubric: ...
1
DeepSeek R1deepseek/deepseek-reasonerreasoning

30x cheaper than GPT-5, comparable accuracy on structured scoring

in: $0.55/Mout: $2.19/Mp50: 6.0s
2
Claude Haiku 4.5anthropic/claude-haiku-4-5-20251001fast

low latency + reliable structured output

in: $0.80/Mout: $4.00/Mp50: 650ms
3
GPT-5 Miniopenai/gpt-5-minifast

strong baseline, excellent JSON adherence

in: $0.15/Mout: $0.60/Mp50: 900ms

Content Safety / Moderation

Detect PII, profanity, harmful content, policy violations

example: Does this user message contain PII?
1
Claude Haiku 4.5anthropic/claude-haiku-4-5-20251001fast

fast + balanced false-positive rate

in: $0.80/Mout: $4.00/Mp50: 650ms
2
GPT-5 Miniopenai/gpt-5-minifast

cheap, well-aligned with OpenAI policy

in: $0.15/Mout: $0.60/Mp50: 900ms
3
DeepSeek V3deepseek/deepseek-chatbalanced

flexible, low cost

in: $0.14/Mout: $0.28/Mp50: 900ms

Data Cleaning / Normalization

Standardize messy fields — names, addresses, phone numbers, currencies

example: Normalize these company names: "Apple Inc.", "APPLE", "apple computer"
1
GPT-5 Miniopenai/gpt-5-minifast

cheap, reliable on bulk normalization

in: $0.15/Mout: $0.60/Mp50: 900ms
2
Gemini 2.5 Flashgoogle/gemini-2.5-flashfast

fast + cheap

in: $0.07/Mout: $0.30/Mp50: 650ms
3
DeepSeek V3deepseek/deepseek-chatbalanced

high accuracy on entity matching

in: $0.14/Mout: $0.28/Mp50: 900ms

Embedding / Vector Encoding

Convert text to vectors for semantic search, RAG, similarity

example: Embed: "AI agents that take real-world actions"
1
text-embedding-3-smallopenai/text-embedding-3-small

cheap, 1536-dim, broad use

2
text-embedding-3-largeopenai/text-embedding-3-large

highest quality 3072-dim

3
embed-english-v3.0cohere/embed-english-v3.0

specialized for retrieval

Fast Classification (latency-critical)

Real-time intent routing, content moderation, chatbot dispatching

example: Is this user message a billing question? yes/no
1
Llama 3.3 70B (Cerebras)cerebras/llama3.3-70bultra-fast

sub-100ms latency for typical prompts

in: $0.85/Mout: $1.20/Mp50: 90ms
2
Llama 3.3 70B (Groq)groq/llama-3.3-70b-versatilefrontier

<500ms, excellent throughput

in: $0.59/Mout: $0.79/Mp50: 450ms
3
Claude Haiku 4.5anthropic/claude-haiku-4-5-20251001fast

650ms p50, very cheap

in: $0.80/Mout: $4.00/Mp50: 650ms

Lead / Prospect Scoring

Score profiles against ICP, compute fit, prioritize outreach lists

example: Score this LinkedIn profile 1-10 against our ICP
1
DeepSeek R1deepseek/deepseek-reasonerreasoning

structured reasoning at lowest cost

in: $0.55/Mout: $2.19/Mp50: 6.0s
2
Claude Haiku 4.5anthropic/claude-haiku-4-5-20251001fast

fast + reliable JSON output

in: $0.80/Mout: $4.00/Mp50: 650ms
3
GPT-5 Miniopenai/gpt-5-minifast

strong baseline

in: $0.15/Mout: $0.60/Mp50: 900ms

Long-Context Synthesis

50+ page docs, full repos, video transcripts — reading and summarizing huge inputs

example: Summarize this 80-page Q3 earnings report into 5 bullet insights
1
Gemini 2.5 Progoogle/gemini-2.5-profrontier

1M context with strong needle-in-haystack accuracy

in: $1.25/Mout: $5.00/Mp50: 2.5s
2
Claude Opus 4.7anthropic/claude-opus-4-7frontier

1M context, best at coherent synthesis

in: $15.00/Mout: $75.00/Mp50: 4.5s
3
GPT-4.1openai/gpt-4.1frontier

1M context, lower cost than Opus

in: $2.00/Mout: $8.00/Mp50: 2.8s

Question Answering

Single-turn Q&A from a known context — knowledge-base lookups, FAQs

example: Given this docs page, what does the rate_limit param do?
1
Claude Haiku 4.5anthropic/claude-haiku-4-5-20251001fast

fast + accurate on retrieval-style

in: $0.80/Mout: $4.00/Mp50: 650ms
2
GPT-5 Miniopenai/gpt-5-minifast

reliable + cheap

in: $0.15/Mout: $0.60/Mp50: 900ms
3
Gemini 2.5 Flashgoogle/gemini-2.5-flashfast

long context for whole-doc grounding

in: $0.07/Mout: $0.30/Mp50: 650ms

Structured Extraction

Convert unstructured text/HTML into JSON schemas — invoices, contracts, web pages, forms

example: Extract {name, role, company, email} from this text...
1
Claude Sonnet 4.6anthropic/claude-sonnet-4-6balanced

best-in-class JSON output adherence

in: $3.00/Mout: $15.00/Mp50: 1.4s
2
GPT-5openai/gpt-5frontier

strong tool-use + structured output

in: $2.50/Mout: $10.00/Mp50: 3.5s
3
Gemini 2.5 Progoogle/gemini-2.5-profrontier

long context for full-doc extraction

in: $1.25/Mout: $5.00/Mp50: 2.5s

Summarization

Compress long text into executive summaries, abstracts, TLDRs

example: Summarize this 5-page article in 100 words
1
Claude Haiku 4.5anthropic/claude-haiku-4-5-20251001fast

fast + factually consistent

in: $0.80/Mout: $4.00/Mp50: 650ms
2
GPT-5 Miniopenai/gpt-5-minifast

cheap baseline

in: $0.15/Mout: $0.60/Mp50: 900ms
3
Gemini 2.5 Flashgoogle/gemini-2.5-flashfast

long-context summary quality

in: $0.07/Mout: $0.30/Mp50: 650ms

Translation

Document or sentence translation across 100+ languages

example: Translate this English email to formal Japanese
1
Gemini 2.5 Progoogle/gemini-2.5-profrontier

strongest on low-resource languages

in: $1.25/Mout: $5.00/Mp50: 2.5s
2
GPT-5 Miniopenai/gpt-5-minifast

reliable on top-50 languages, fast

in: $0.15/Mout: $0.60/Mp50: 900ms
3
DeepSeek V3deepseek/deepseek-chatbalanced

surprisingly strong on Asian languages

in: $0.14/Mout: $0.28/Mp50: 900ms

Vision / Image Understanding

OCR, chart reading, screenshot analysis, image description

example: What does this dashboard screenshot show?
1
Claude Opus 4.7anthropic/claude-opus-4-7frontier

top OCR + chart understanding

in: $15.00/Mout: $75.00/Mp50: 4.5s
2
GPT-5openai/gpt-5frontier

excellent at structured extraction from images

in: $2.50/Mout: $10.00/Mp50: 3.5s
3
Gemini 2.5 Progoogle/gemini-2.5-profrontier

strong on multi-image reasoning

in: $1.25/Mout: $5.00/Mp50: 2.5s

Web Research

Live web search + synthesis — current events, citations, breaking news

example: What did Anthropic announce this week?
1
Sonar Properplexity/sonar-proweb-grounded

best citations, fastest grounded chat

in: $3.00/Mout: $15.00/Mp50: 4.5s
2
Gemini 2.5 Progoogle/gemini-2.5-profrontier

native search grounding, transparent sources

in: $1.25/Mout: $5.00/Mp50: 2.5s
3
Sonar Deep Researchperplexity/sonar-deep-researchweb-grounded

long-form research with full citations

in: $8.00/Mout: $30.00/Mp50: 180.0s

code

3 tasks

Code Generation

Write functions, refactor, debug, generate boilerplate — programming tasks

example: Write a TypeScript function that validates an email...
1
Claude Sonnet 4.6anthropic/claude-sonnet-4-6balanced

top human-eval scores, excellent code quality

in: $3.00/Mout: $15.00/Mp50: 1.4s
2
DeepSeek V3deepseek/deepseek-chatbalanced

strong on code, cheapest in tier

in: $0.14/Mout: $0.28/Mp50: 900ms
3
GPT-5openai/gpt-5frontier

reliable across languages, good debugging

in: $2.50/Mout: $10.00/Mp50: 3.5s

Code Review / Security Audit

Review pull requests, find bugs, identify security issues

example: Review this Python function for security issues
1
Claude Opus 4.7anthropic/claude-opus-4-7frontier

thorough security analysis

in: $15.00/Mout: $75.00/Mp50: 4.5s
2
Claude Sonnet 4.6anthropic/claude-sonnet-4-6balanced

strong PR-style review at lower cost

in: $3.00/Mout: $15.00/Mp50: 1.4s
3
o3 (reasoning)openai/o3reasoning

deep multi-file reasoning

in: $15.00/Mout: $60.00/Mp50: 8.5s

Function Calling / Tool Use

Pick the right tool from a list, fill its arguments, chain calls

example: User wants to book a meeting — call the right calendar tool
1
GPT-5openai/gpt-5frontier

most reliable tool-use adherence

in: $2.50/Mout: $10.00/Mp50: 3.5s
2
Claude Sonnet 4.6anthropic/claude-sonnet-4-6balanced

strong + fast tool-use

in: $3.00/Mout: $15.00/Mp50: 1.4s
3
Gemini 2.5 Progoogle/gemini-2.5-profrontier

native function calling, parallel tools

in: $1.25/Mout: $5.00/Mp50: 2.5s

Creative & Communication

4 tasks

Creative Writing

Cold emails, marketing copy, product descriptions, blog posts — anything requiring voice

example: Write a 3-line cold email pitching our SaaS to a CTO
1
Claude Opus 4.7anthropic/claude-opus-4-7frontier

most natural voice, low slop

in: $15.00/Mout: $75.00/Mp50: 4.5s
2
GPT-5openai/gpt-5frontier

versatile across tones

in: $2.50/Mout: $10.00/Mp50: 3.5s
3
Claude Sonnet 4.6anthropic/claude-sonnet-4-6balanced

close to Opus quality, 5x cheaper

in: $3.00/Mout: $15.00/Mp50: 1.4s

Email / Outreach Drafting

Cold emails, follow-ups, polite replies — short business writing

example: Write a follow-up email after 3 days no response
1
Claude Sonnet 4.6anthropic/claude-sonnet-4-6balanced

natural voice + on-brand tone

in: $3.00/Mout: $15.00/Mp50: 1.4s
2
GPT-5 Miniopenai/gpt-5-minifast

cheap and fast for high-volume drafting

in: $0.15/Mout: $0.60/Mp50: 900ms
3
DeepSeek V3deepseek/deepseek-chatbalanced

surprisingly natural at 1/100th the cost

in: $0.14/Mout: $0.28/Mp50: 900ms

Image Generation

Marketing visuals, illustrations, product mockups

example: A flat-illustration hero image for an AI agent platform
1
gpt-image-1openai/gpt-image-1

best text-rendering + adherence

2
dall-e-3openai/dall-e-3

more affordable, good general quality

Persona / Character Chat

Long-running conversational agent with personality, memory, brand voice

example: Hey, what should I know about you?
1
Claude Opus 4.7anthropic/claude-opus-4-7frontier

most natural voice, persistent personality

in: $15.00/Mout: $75.00/Mp50: 4.5s
2
GPT-5openai/gpt-5frontier

reliable persona adherence

in: $2.50/Mout: $10.00/Mp50: 3.5s
3
Grok 4xai/grok-4frontier

distinctive voice, real-time X.com signals

in: $3.00/Mout: $15.00/Mp50: 2.2s

Tired of picking the right LLM by hand?

ArgusFlow's pipeline builder uses this leaderboard automatically. Hit "Suggested ✨" and we wire the right model into your agent.

Build an agent →