AWS · FinOps for AI · Updated June 2026

Amazon Bedrock Cost Optimization: Token FinOps on AWS


By the CloudFinOpsKit team — building the AWS tool's Amazon Bedrock FinOps module. 10 min read.

AI spend has gone mainstream — and on AWS, Amazon Bedrock is where it lands. The trouble is Bedrock breaks the old FinOps playbook: there's no "instance" to right-size, no resource to delete. The unit of cost is the token, and the bill is driven by how you call the model, not what you provisioned. The good news: once you can see tokens per model, five levers control almost all of the spend.

First, get per-model visibility

Bedrock emits CloudWatch metrics in the AWS/Bedrock namespace, dimensioned by ModelId. Discover what actually ran, then total it:

# Which models were invoked recently?
aws cloudwatch list-metrics --namespace AWS/Bedrock \
  --metric-name Invocations --query "Metrics[].Dimensions[?Name=='ModelId'].Value | []"

# Per model: input/output tokens, invocations, cache reads
#   AWS/Bedrock metrics: InputTokenCount, OutputTokenCount,
#   Invocations, CacheReadInputTokenCount  (dimension ModelId)

From those four numbers you can compute the ratios that matter: output-to-input ratio, tokens per invocation, and cache hit rate. They tell you which lever to pull.

Lever 1 — Prompt caching (the biggest input-side win)

If your traffic is input-dominant — a large system prompt, tool schemas, or RAG context re-sent on every call — and your cache-read rate is near zero, you're paying full input price for tokens you could read from cache at a steep discount. Enable Bedrock prompt caching: mark the stable prefix with a cachePoint block, put the volatile user turn after it. Supported on Claude, Nova and others via the Converse/InvokeModel API.

Lever 2 — Cap output tokens (output bills well above input)

Output tokens cost several times input on every model, so unbounded replies are the easiest leak. Set a sensible maxTokens on the call (most chat workloads finish under ~400), and add a "be concise" system instruction. If you see a high average output-per-invocation in your CloudWatch numbers, this is your fix.

Lever 3 — Right-size the model (Lever 1 of token efficiency)

A frontier model on every request is the GenAI equivalent of running everything on the biggest instance. Many steps — classification, extraction, routing, simple Q&A — run just as well on a smaller, far cheaper model. A frontier model can cost up to ~190× a small one for the same task. Two moves:

Lever 4 — Trim the input context (Lever 2 of token efficiency)

A high average input-per-invocation means a big, repeated context is billed on every call. Tighten retrieval (send the chunks that matter, not the top-20 "just in case"), prune tool schemas to what the step needs, and summarize or window long conversation history instead of replaying the whole transcript.

Lever 5 — Provisioned Throughput, only when it pays

Bedrock Provisioned Throughput reserves model capacity at a flat hourly rate per model unit — cheaper than on-demand only at sustained high utilization. The classic waste pattern is a provisioned throughput with few or zero invocations: a flat bill for capacity nobody uses. For low or spiky volume, stay on on-demand (per-token). If you do provision, size the model units to the p95 of real demand, and if it carries a term commitment you can't reclaim mid-term — route eligible traffic to it or don't renew.

Allocate it: token cost is a showback problem too

Tag Bedrock usage by application/team (via the calling role or application inference profiles) so token spend rolls up to an owner. AI cost you can't attribute is AI cost you can't govern — and "cost per AI feature" is fast becoming a board-level metric.

See your Bedrock token economics automatically. The CloudFinOpsKit AWS Tool includes a dedicated Amazon Bedrock module: it discovers every model you actually invoked from CloudWatch, reports input/output tokens, invocations and cache rate per model, and flags the actionable patterns — low prompt-cache hit rate, output-token bloat, premium-model right-sizing, and under-used Provisioned Throughput — read-only, alongside 70+ other Well-Architected checks.

Want the prompts too? The toolkit ships a 90-prompt FinOps AI Prompt Library for running this analysis with ChatGPT, Amazon Q or Claude.

FAQ

Why is my Bedrock bill mostly output tokens?

Output bills well above input on every model. Unbounded replies are the usual cause — set maxTokens and instruct the model to be concise.

Is prompt caching free?

No, but cache reads bill at a steep discount versus standard input, so for repetitive, input-heavy traffic the net saving is large. It only helps when a stable prefix is re-sent across calls.

On-demand or Provisioned Throughput?

On-demand (per-token) for low or variable volume; Provisioned Throughput only when you'll keep it at high, sustained utilization. Measure invocations on the provisioned ARN before committing.

Related reading: the AWS cost optimization checklist for 2026 · Savings Plans vs Reserved Instances · find unattached EBS volumes