AI Cost Governance: Controlling Azure OpenAI & Token Spend
AI cost management is the number-one focus area for FinOps teams in 2026 — and the one most are least equipped for. The State of FinOps survey reports that nearly all teams now manage AI spend (up from under a third two years earlier), and that AI cost is the top skill they want to add. The reason it's hard is that AI breaks the assumptions traditional cloud cost governance was built on. This guide covers what's different, the levers that actually control AI spend, and how to attribute it so AI joins your showback and unit economics like any other cost.
Why AI spend breaks the old playbook
- It's token-based, not hour-based. A VM costs the same whether it's busy or idle; an Azure OpenAI call costs per input and output token. Cost scales with usage and prompt design, which classic rightsizing never modelled.
- It's volatile. One prompt change, one new feature, one retry loop can multiply spend overnight. AI is where your anomaly detection earns its keep.
- It's governed at the model and feature level, not the resource level — the meaningful questions are "which model?" and "which feature is driving tokens?", not "which VM?".
- Attribution is murky. Many features often share one deployment, so the bill doesn't tell you who spent what without extra work.
The levers that actually control AI cost
| Lever | What it does | When to use |
|---|---|---|
| Prompt caching | Cached (repeated) input tokens are billed at a steep discount — often the single biggest saving for input-heavy apps. | Long, stable system prompts; RAG with repeated context; high request volume. |
| Right-size the model | Use the smallest model that meets the quality bar; reserve frontier models for the calls that need them. | Always — most workloads over-spec the model. |
| Cap output tokens | Output tokens cost several times input tokens; an unbounded max_tokens lets replies (and cost) run away. | Any production call — set a sensible ceiling. |
| PTU vs pay-as-you-go | Provisioned throughput (PTU) is a committed-capacity rate; cheaper per token at high, steady volume, wasteful when under-used. | PTU only when utilization justifies it; PAYG otherwise. |
| Kill zombie deployments | Deployments billed (especially PTU) but serving ~zero requests are pure waste. | Review regularly; delete the unused. |
The highest-leverage and most-missed lever is prompt caching. Apps with a large, stable system prompt and repetitive context can have input tokens dwarf output tokens — if those input tokens aren't being cached, you're paying full rate for the same context on every call.
AI waste, surfaced automatically. The CloudFinOpsKit Tool includes an AI Workloads module that scans your Azure OpenAI / AI Foundry deployments and reports token usage and cost per model, then flags the exact leaks above: zombie deployments (billed, zero requests), under-utilized PTU, low prompt-cache hit rate (input-dominant traffic that isn't being cached), and oversized outputs. It turns "AI is expensive" into specific, costed actions.
Attribute AI cost like any other cost
You can't govern what you can't attribute. Bring AI into your existing model:
- Separate deployments per team/feature where practical — Cost Management then attributes spend cleanly, and you can tag each for showback.
- Per-feature token telemetry when a deployment is shared — log tokens by feature in your app and allocate the shared deployment's cost proportionally.
- Roll AI into the Bill of Cloud and unit economics. Once attributed, AI spend belongs in your Bill of Cloud and in unit metrics — cost per inference, cost per AI-assisted transaction.
Guardrails for AI spend
The same governance framework applies, tuned for AI:
- Budgets & anomaly alerts on AI resources specifically — AI's volatility makes early warning essential.
- Default output caps and model choice baked into shared client libraries so teams inherit good defaults.
- Caching on by default for input-heavy patterns.
- A regular AI deployment review — utilization, cache hit rate, and zombies — as part of the monthly cost review.
FAQ
Is prompt caching really that significant?
For input-dominant workloads, yes — when the same large context is sent on every call, caching those input tokens can remove a large fraction of input cost. The tell-tale sign is input tokens vastly exceeding output tokens with a low cache-hit rate.
Should we always use the cheapest model?
Use the cheapest model that meets your quality bar — which varies by task. The waste is using a frontier model for work a smaller one handles fine. Route hard calls up, keep the rest down.
When does provisioned throughput (PTU) make sense?
At high, steady volume where committed capacity is cheaper per token than pay-as-you-go and you'll actually use it. Under-used PTU is one of the most common AI waste findings — match the commitment to real utilization.
Related reading: cloud unit economics: cost per customer · catch spend spikes with anomaly detection · the cloud cost governance framework