How do I see per-model token usage in Amazon Bedrock?

Amazon Bedrock publishes CloudWatch metrics in the AWS/Bedrock namespace, dimensioned by ModelId: InputTokenCount, OutputTokenCount, Invocations, and CacheReadInputTokenCount. Use 'aws cloudwatch list-metrics --namespace AWS/Bedrock' to see which models actually ran, then get-metric-statistics per model to total tokens and invocations.

Does Amazon Bedrock support prompt caching?

Yes. Bedrock prompt caching lets you mark a stable prompt prefix (system prompt, tool definitions, RAG context) with a cache checkpoint so its tokens bill at a steep discount on cache reads. It is supported on models including Claude and Nova via the cachePoint content block in the Converse/InvokeModel API — ideal for input-dominant, repetitive workloads.

Should I buy Provisioned Throughput for Bedrock?

Only at sustained, high utilization. Provisioned Throughput bills a flat hourly rate per model unit regardless of use, so it is cheaper than on-demand only when you keep it busy. For low or spiky volume, on-demand (per-token) invocation is usually cheaper. A provisioned throughput with few invocations is a classic zombie cost.

AWS · FinOps for AI · Updated June 2026

Amazon Bedrock Cost Optimization: Token FinOps on AWS

By the CloudFinOpsKit team — building the AWS tool's Amazon Bedrock FinOps module. 10 min read.

AI spend has gone mainstream — and on AWS, Amazon Bedrock is where it lands. The trouble is Bedrock breaks the old FinOps playbook: there's no "instance" to right-size, no resource to delete. The unit of cost is the token, and the bill is driven by how you call the model, not what you provisioned. The good news: once you can see tokens per model, five levers control almost all of the spend.

First, get per-model visibility

Bedrock emits CloudWatch metrics in the AWS/Bedrock namespace, dimensioned by ModelId. Discover what actually ran, then total it:

# Which models were invoked recently?
aws cloudwatch list-metrics --namespace AWS/Bedrock \
  --metric-name Invocations --query "Metrics[].Dimensions[?Name=='ModelId'].Value | []"

# Per model: input/output tokens, invocations, cache reads
#   AWS/Bedrock metrics: InputTokenCount, OutputTokenCount,
#   Invocations, CacheReadInputTokenCount  (dimension ModelId)

From those four numbers you can compute the ratios that matter: output-to-input ratio, tokens per invocation, and cache hit rate. They tell you which lever to pull.

Lever 1 — Prompt caching (the biggest input-side win)

If your traffic is input-dominant — a large system prompt, tool schemas, or RAG context re-sent on every call — and your cache-read rate is near zero, you're paying full input price for tokens you could read from cache at a steep discount. Enable Bedrock prompt caching: mark the stable prefix with a cachePoint block, put the volatile user turn after it. Supported on Claude, Nova and others via the Converse/InvokeModel API.

Lever 2 — Cap output tokens (output bills well above input)

Output tokens cost several times input on every model, so unbounded replies are the easiest leak. Set a sensible maxTokens on the call (most chat workloads finish under ~400), and add a "be concise" system instruction. If you see a high average output-per-invocation in your CloudWatch numbers, this is your fix.

Lever 3 — Right-size the model (Lever 1 of token efficiency)

A frontier model on every request is the GenAI equivalent of running everything on the biggest instance. Many steps — classification, extraction, routing, simple Q&A — run just as well on a smaller, far cheaper model. A frontier model can cost up to ~190× a small one for the same task. Two moves:

A/B a smaller sibling (e.g. Claude Haiku, Amazon Nova Lite/Micro, Llama 8B) against real prompts and compare quality.
Route / cascade: answer easy requests on the small model, escalate only the hard ones to the frontier model — typically 60–85% cheaper overall.

Lever 4 — Trim the input context (Lever 2 of token efficiency)

A high average input-per-invocation means a big, repeated context is billed on every call. Tighten retrieval (send the chunks that matter, not the top-20 "just in case"), prune tool schemas to what the step needs, and summarize or window long conversation history instead of replaying the whole transcript.

Lever 5 — Provisioned Throughput, only when it pays

Bedrock Provisioned Throughput reserves model capacity at a flat hourly rate per model unit — cheaper than on-demand only at sustained high utilization. The classic waste pattern is a provisioned throughput with few or zero invocations: a flat bill for capacity nobody uses. For low or spiky volume, stay on on-demand (per-token). If you do provision, size the model units to the p95 of real demand, and if it carries a term commitment you can't reclaim mid-term — route eligible traffic to it or don't renew.

Allocate it: token cost is a showback problem too

Tag Bedrock usage by application/team (via the calling role or application inference profiles) so token spend rolls up to an owner. AI cost you can't attribute is AI cost you can't govern — and "cost per AI feature" is fast becoming a board-level metric.

See your Bedrock token economics automatically. The CloudFinOpsKit AWS Tool includes a dedicated Amazon Bedrock module: it discovers every model you actually invoked from CloudWatch, reports input/output tokens, invocations and cache rate per model, and flags the actionable patterns — low prompt-cache hit rate, output-token bloat, premium-model right-sizing, and under-used Provisioned Throughput — read-only, alongside 70+ other Well-Architected checks.

Want the prompts too? The toolkit ships a 90-prompt FinOps AI Prompt Library for running this analysis with ChatGPT, Amazon Q or Claude.

FAQ

Why is my Bedrock bill mostly output tokens?

Output bills well above input on every model. Unbounded replies are the usual cause — set maxTokens and instruct the model to be concise.

Is prompt caching free?

No, but cache reads bill at a steep discount versus standard input, so for repetitive, input-heavy traffic the net saving is large. It only helps when a stable prefix is re-sent across calls.

On-demand or Provisioned Throughput?

On-demand (per-token) for low or variable volume; Provisioned Throughput only when you'll keep it at high, sustained utilization. Measure invocations on the provisioned ARN before committing.