Why is AI cost management harder than normal cloud cost?

AI spend is token-based, not resource-hour-based, so it scales with usage in ways traditional rightsizing doesn't capture. It's volatile (a prompt change can multiply cost overnight), it's driven at the model and feature level rather than the VM level, and attribution is hard because many features share one model deployment. That's why the State of FinOps 2026 survey reports AI cost management as the number-one focus area and skill teams want to add.

How do you reduce Azure OpenAI costs?

The biggest levers: enable prompt caching for input-heavy, repetitive prompts (cached input tokens are billed at a large discount); use the smallest model that meets the quality bar; cap output tokens so replies don't run unbounded; choose provisioned throughput (PTU) only when utilization justifies it, otherwise pay-as-you-go; and delete zombie deployments that are billed but unused. Together these often cut Azure OpenAI spend substantially without hurting quality.

How do you attribute AI/token costs to a team or feature?

Use a separate deployment (or separate Azure OpenAI resource) per team or feature where practical, so Cost Management attributes spend cleanly; tag those resources with your allocation tags; and where one deployment is shared, track per-feature token usage in your application telemetry and allocate proportionally. Clean attribution is what lets you put AI cost into showback and unit economics.

Cost governance · Updated June 2026

AI Cost Governance: Controlling Azure OpenAI & Token Spend

By the CloudFinOpsKit team. 9 min read.

AI cost management is the number-one focus area for FinOps teams in 2026 — and the one most are least equipped for. The State of FinOps survey reports that nearly all teams now manage AI spend (up from under a third two years earlier), and that AI cost is the top skill they want to add. The reason it's hard is that AI breaks the assumptions traditional cloud cost governance was built on. This guide covers what's different, the levers that actually control AI spend, and how to attribute it so AI joins your showback and unit economics like any other cost.

Why AI spend breaks the old playbook

It's token-based, not hour-based. A VM costs the same whether it's busy or idle; an Azure OpenAI call costs per input and output token. Cost scales with usage and prompt design, which classic rightsizing never modelled.
It's volatile. One prompt change, one new feature, one retry loop can multiply spend overnight. AI is where your anomaly detection earns its keep.
It's governed at the model and feature level, not the resource level — the meaningful questions are "which model?" and "which feature is driving tokens?", not "which VM?".
Attribution is murky. Many features often share one deployment, so the bill doesn't tell you who spent what without extra work.

The levers that actually control AI cost

Lever	What it does	When to use
Prompt caching	Cached (repeated) input tokens are billed at a steep discount — often the single biggest saving for input-heavy apps.	Long, stable system prompts; RAG with repeated context; high request volume.
Right-size the model	Use the smallest model that meets the quality bar; reserve frontier models for the calls that need them.	Always — most workloads over-spec the model.
Cap output tokens	Output tokens cost several times input tokens; an unbounded `max_tokens` lets replies (and cost) run away.	Any production call — set a sensible ceiling.
PTU vs pay-as-you-go	Provisioned throughput (PTU) is a committed-capacity rate; cheaper per token at high, steady volume, wasteful when under-used.	PTU only when utilization justifies it; PAYG otherwise.
Kill zombie deployments	Deployments billed (especially PTU) but serving ~zero requests are pure waste.	Review regularly; delete the unused.

The highest-leverage and most-missed lever is prompt caching. Apps with a large, stable system prompt and repetitive context can have input tokens dwarf output tokens — if those input tokens aren't being cached, you're paying full rate for the same context on every call.

AI waste, surfaced automatically. The CloudFinOpsKit Tool includes an AI Workloads module that scans your Azure OpenAI / AI Foundry deployments and reports token usage and cost per model, then flags the exact leaks above: zombie deployments (billed, zero requests), under-utilized PTU, low prompt-cache hit rate (input-dominant traffic that isn't being cached), and oversized outputs. It turns "AI is expensive" into specific, costed actions.

Attribute AI cost like any other cost

You can't govern what you can't attribute. Bring AI into your existing model:

Separate deployments per team/feature where practical — Cost Management then attributes spend cleanly, and you can tag each for showback.
Per-feature token telemetry when a deployment is shared — log tokens by feature in your app and allocate the shared deployment's cost proportionally.
Roll AI into the Cost of Cloud and unit economics. Once attributed, AI spend belongs in your Cost of Cloud and in unit metrics — cost per inference, cost per AI-assisted transaction.

Guardrails for AI spend

The same governance framework applies, tuned for AI:

Budgets & anomaly alerts on AI resources specifically — AI's volatility makes early warning essential.
Default output caps and model choice baked into shared client libraries so teams inherit good defaults.
Caching on by default for input-heavy patterns.
A regular AI deployment review — utilization, cache hit rate, and zombies — as part of the monthly cost review.

FAQ

Is prompt caching really that significant?

For input-dominant workloads, yes — when the same large context is sent on every call, caching those input tokens can remove a large fraction of input cost. The tell-tale sign is input tokens vastly exceeding output tokens with a low cache-hit rate.

Should we always use the cheapest model?

Use the cheapest model that meets your quality bar — which varies by task. The waste is using a frontier model for work a smaller one handles fine. Route hard calls up, keep the rest down.

When does provisioned throughput (PTU) make sense?

At high, steady volume where committed capacity is cheaper per token than pay-as-you-go and you'll actually use it. Under-used PTU is one of the most common AI waste findings — match the commitment to real utilization.