Claude Sonnet 5 API: quick-start in minutes
Everything you need to call claude-sonnet-5 from the Messages API - the model id, working curl and Python examples, the effort parameter, adaptive thinking, and how to re-tune max_tokens for the new tokenizer.
Quick start
Sonnet 5 is a drop-in on the standard Messages API. If you already call Opus 4.8 or Sonnet 4.6, you are three small steps away.
Set the model id
Use claude-sonnet-5 - a pinned, dateless snapshot with no -v1 suffix. No other change is required for a basic call.
Drop sampling params
Remove temperature, top_p and top_k. Sonnet 5 rejects them with HTTP 400, exactly like Opus 4.7 and later.
Keep thinking adaptive
Adaptive thinking is on by default. Do not send manual extended-thinking config - explicit thinking returns 400. Tune depth with effort instead.
curl https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{
"model": "claude-sonnet-5",
"max_tokens": 4096,
"thinking": { "type": "adaptive" },
"effort": "high",
"messages": [
{ "role": "user", "content": "Refactor this module and add tests." }
]
}'
# NOTE: do NOT send temperature / top_p / top_k -> HTTP 400
from anthropic import Anthropic
client = Anthropic() # reads ANTHROPIC_API_KEY
resp = client.messages.create(
model="claude-sonnet-5",
max_tokens=4096,
thinking={"type": "adaptive"}, # on by default
effort="high", # low | medium | high | xhigh | max
messages=[
{"role": "user", "content": "Explain this stack trace and propose a fix."}
],
# temperature / top_p / top_k omitted -> sending them returns HTTP 400
)
print(resp.content[0].text)
Both examples omit temperature, top_p and top_k on purpose: sending any of them to claude-sonnet-5 returns HTTP 400. Adaptive thinking (thinking={"type":"adaptive"}) is the default, and the effort field controls how much reasoning the model spends per request.
Re-tune your max_tokens
Sonnet 5 ships a new tokenizer. The same text consumes about 30% more tokens than Sonnet 4.6, so limits and budgets copied from older code will bite.
Raise your output cap
A max_tokens value that comfortably fit a Sonnet 4.6 response may now truncate. Give roughly 30% more headroom, up to the 128K output ceiling (300K via the output-300k-2026-03-24 batches beta header).
Re-check context budgets
The context window is 1M tokens - both the default and the max, with no smaller variant - but prompts that were sized in tokens will pack more densely, so re-measure with token counting rather than assuming.
Re-estimate cost per request
Because the same text is ~30% more tokens, the intro price of $2/MTok in + $10/MTok out is best read as roughly cost-neutral versus Sonnet 4.6's $3/$15 on identical text - not a flat 33% discount.
Introductory pricing ($2/MTok input + $10/MTok output) runs through Aug 31, 2026; standard pricing is $3/MTok input + $15/MTok output from Sep 1, 2026. Cache reads are $0.20 intro / $0.30 standard, a 5-minute cache write is 1.25x base input and a 1-hour cache write is 2x base input. Always size max_tokens against the new tokenizer, not your old counts.
Effort levels
The effort parameter tunes how much reasoning Sonnet 5 spends per request. It accepts five levels; the default is high.
| effort | Best for | Latency & output tokens |
|---|---|---|
| low | Classification, extraction, short chat and other latency-sensitive, well-scoped calls. | Lowest latency, fewest reasoning tokens. |
| medium | Everyday assistance and lighter coding where you want a little more deliberation without full depth. | Moderate latency and token spend. |
| high default | Most agentic and coding work - the balanced default that pairs speed with strong reasoning. | Balanced; the recommended starting point. |
| xhigh | Hard multi-step debugging, architecture and long-horizon agent runs that reward deeper thinking. | Higher latency, more reasoning tokens. |
| max | The most demanding reasoning and judgment tasks where you want maximum depth regardless of cost. | Highest latency and token spend. |
Start at high and only move down for cheap, latency-sensitive traffic or up for the hardest problems. Sonnet 5 approaches Opus 4.8 quality at a lower price, but Opus 4.8 still leads the hardest coding, judgment and cyber tasks - reach for it when max effort on Sonnet 5 is not enough.
Using Sonnet 5 through QCode
QCode relays the same Messages API, so your Sonnet 5 code is unchanged - only the base URL and key differ.
Identical API surface
Same claude-sonnet-5 model id, same effort parameter, same adaptive thinking. Point the SDK at the relay base URL and your existing code just works.
One key, many models
A single QCode key reaches Claude, Codex and Gemini models - no juggling separate provider accounts or billing.
Stable China access
The relay provides low-latency, reliable access from mainland China, so Sonnet 5 works without wrestling with cross-border connectivity.
Drop-in for tools
Works with the anthropic SDK, Claude Code and any client that speaks the Messages API - set the base URL and go.
from anthropic import Anthropic
client = Anthropic(
base_url="https://relay.qcode.cc", # QCode relay
api_key="qk-..." # one key for Claude / Codex / Gemini
)
resp = client.messages.create(
model="claude-sonnet-5",
max_tokens=4096,
thinking={"type": "adaptive"},
effort="high",
messages=[{"role": "user", "content": "Ship it."}],
)
Frequently asked questions
What is the Claude Sonnet 5 model id?
The model id is claude-sonnet-5 - a pinned, dateless snapshot with no -v1 suffix. On Amazon Bedrock it is anthropic.claude-sonnet-5, and on OpenRouter the slug is anthropic/claude-sonnet-5-20260630. Pass claude-sonnet-5 in the model field of any Messages API request.
Do I need to change my code to call Sonnet 5?
Mostly no - swap the model id to claude-sonnet-5 and your existing Messages API code keeps working. Two things to check: remove any temperature, top_p or top_k fields (they return HTTP 400 on Sonnet 5, same as Opus 4.7+), and do not send manual extended-thinking config - adaptive thinking is on by default and explicit thinking config returns 400. Also re-tune max_tokens, because the new tokenizer emits about 30% more tokens for the same text.
Which effort level should I use with Sonnet 5?
The default is high, which suits most agentic and coding work. Use low or medium for cheap, latency-sensitive calls like classification, extraction and short chat, and xhigh or max for the hardest multi-step reasoning where you want maximum depth. Effort trades latency and output tokens for reasoning depth, so start at high and adjust per workload.
Can I use Claude Sonnet 5 through QCode?
Yes. QCode exposes the same Messages API surface via its relay, so you keep the claude-sonnet-5 model id, the effort parameter and adaptive thinking unchanged - only the base URL and key differ. One QCode key works across Claude, Codex and Gemini models, and the relay provides stable low-latency access from mainland China.
Start building with Claude Sonnet 5
Get one key for Claude, Codex and Gemini - with stable access and the same Messages API you already use.
Related guides
Claude Sonnet 5 overview
The full model briefing: positioning, context window, availability and where Sonnet 5 fits in the lineup.
Sonnet 5 vs Sonnet 4.6
What changed - tokenizer, pricing, effort and quality - and why Sonnet 4.6 is not retired.
Claude Sonnet 5 pricing
Intro vs standard rates, caching costs and the tokenizer caveat that shapes real per-request spend.
Claude Code cost optimization
Practical tactics - effort tuning, caching and model choice - to keep agentic coding spend under control.