Amazon Bedrock Cost Optimization: Token FinOps on AWS
AI spend has gone mainstream — and on AWS, Amazon Bedrock is where it lands. The trouble is Bedrock breaks the old FinOps playbook: there's no "instance" to right-size, no resource to delete. The unit of cost is the token, and the bill is driven by how you call the model, not what you provisioned. The good news: once you can see tokens per model, five levers control almost all of the spend.
First, get per-model visibility
Bedrock emits CloudWatch metrics in the AWS/Bedrock namespace, dimensioned by ModelId. Discover what actually ran, then total it:
# Which models were invoked recently?
aws cloudwatch list-metrics --namespace AWS/Bedrock \
--metric-name Invocations --query "Metrics[].Dimensions[?Name=='ModelId'].Value | []"
# Per model: input/output tokens, invocations, cache reads
# AWS/Bedrock metrics: InputTokenCount, OutputTokenCount,
# Invocations, CacheReadInputTokenCount (dimension ModelId)
From those four numbers you can compute the ratios that matter: output-to-input ratio, tokens per invocation, and cache hit rate. They tell you which lever to pull.
Lever 1 — Prompt caching (the biggest input-side win)
If your traffic is input-dominant — a large system prompt, tool schemas, or RAG context re-sent on every call — and your cache-read rate is near zero, you're paying full input price for tokens you could read from cache at a steep discount. Enable Bedrock prompt caching: mark the stable prefix with a cachePoint block, put the volatile user turn after it. Supported on Claude, Nova and others via the Converse/InvokeModel API.
Lever 2 — Cap output tokens (output bills well above input)
Output tokens cost several times input on every model, so unbounded replies are the easiest leak. Set a sensible maxTokens on the call (most chat workloads finish under ~400), and add a "be concise" system instruction. If you see a high average output-per-invocation in your CloudWatch numbers, this is your fix.
Lever 3 — Right-size the model (Lever 1 of token efficiency)
A frontier model on every request is the GenAI equivalent of running everything on the biggest instance. Many steps — classification, extraction, routing, simple Q&A — run just as well on a smaller, far cheaper model. A frontier model can cost up to ~190× a small one for the same task. Two moves:
- A/B a smaller sibling (e.g. Claude Haiku, Amazon Nova Lite/Micro, Llama 8B) against real prompts and compare quality.
- Route / cascade: answer easy requests on the small model, escalate only the hard ones to the frontier model — typically 60–85% cheaper overall.
Lever 4 — Trim the input context (Lever 2 of token efficiency)
A high average input-per-invocation means a big, repeated context is billed on every call. Tighten retrieval (send the chunks that matter, not the top-20 "just in case"), prune tool schemas to what the step needs, and summarize or window long conversation history instead of replaying the whole transcript.
Lever 5 — Provisioned Throughput, only when it pays
Bedrock Provisioned Throughput reserves model capacity at a flat hourly rate per model unit — cheaper than on-demand only at sustained high utilization. The classic waste pattern is a provisioned throughput with few or zero invocations: a flat bill for capacity nobody uses. For low or spiky volume, stay on on-demand (per-token). If you do provision, size the model units to the p95 of real demand, and if it carries a term commitment you can't reclaim mid-term — route eligible traffic to it or don't renew.
Allocate it: token cost is a showback problem too
Tag Bedrock usage by application/team (via the calling role or application inference profiles) so token spend rolls up to an owner. AI cost you can't attribute is AI cost you can't govern — and "cost per AI feature" is fast becoming a board-level metric.
See your Bedrock token economics automatically. The CloudFinOpsKit AWS Tool includes a dedicated Amazon Bedrock module: it discovers every model you actually invoked from CloudWatch, reports input/output tokens, invocations and cache rate per model, and flags the actionable patterns — low prompt-cache hit rate, output-token bloat, premium-model right-sizing, and under-used Provisioned Throughput — read-only, alongside 70+ other Well-Architected checks.
Want the prompts too? The toolkit ships a 90-prompt FinOps AI Prompt Library for running this analysis with ChatGPT, Amazon Q or Claude.
FAQ
Why is my Bedrock bill mostly output tokens?
Output bills well above input on every model. Unbounded replies are the usual cause — set maxTokens and instruct the model to be concise.
Is prompt caching free?
No, but cache reads bill at a steep discount versus standard input, so for repetitive, input-heavy traffic the net saving is large. It only helps when a stable prefix is re-sent across calls.
On-demand or Provisioned Throughput?
On-demand (per-token) for low or variable volume; Provisioned Throughput only when you'll keep it at high, sustained utilization. Measure invocations on the provisioned ARN before committing.
Related reading: the AWS cost optimization checklist for 2026 · Savings Plans vs Reserved Instances · find unattached EBS volumes