Cost governance · Updated June 2026

AI Cost Governance: Controlling Azure OpenAI & Token Spend

By the CloudFinOpsKit team. 9 min read.

AI cost management is the number-one focus area for FinOps teams in 2026 — and the one most are least equipped for. The State of FinOps survey reports that nearly all teams now manage AI spend (up from under a third two years earlier), and that AI cost is the top skill they want to add. The reason it's hard is that AI breaks the assumptions traditional cloud cost governance was built on. This guide covers what's different, the levers that actually control AI spend, and how to attribute it so AI joins your showback and unit economics like any other cost.

Why AI spend breaks the old playbook

The levers that actually control AI cost

LeverWhat it doesWhen to use
Prompt cachingCached (repeated) input tokens are billed at a steep discount — often the single biggest saving for input-heavy apps.Long, stable system prompts; RAG with repeated context; high request volume.
Right-size the modelUse the smallest model that meets the quality bar; reserve frontier models for the calls that need them.Always — most workloads over-spec the model.
Cap output tokensOutput tokens cost several times input tokens; an unbounded max_tokens lets replies (and cost) run away.Any production call — set a sensible ceiling.
PTU vs pay-as-you-goProvisioned throughput (PTU) is a committed-capacity rate; cheaper per token at high, steady volume, wasteful when under-used.PTU only when utilization justifies it; PAYG otherwise.
Kill zombie deploymentsDeployments billed (especially PTU) but serving ~zero requests are pure waste.Review regularly; delete the unused.

The highest-leverage and most-missed lever is prompt caching. Apps with a large, stable system prompt and repetitive context can have input tokens dwarf output tokens — if those input tokens aren't being cached, you're paying full rate for the same context on every call.

AI waste, surfaced automatically. The CloudFinOpsKit Tool includes an AI Workloads module that scans your Azure OpenAI / AI Foundry deployments and reports token usage and cost per model, then flags the exact leaks above: zombie deployments (billed, zero requests), under-utilized PTU, low prompt-cache hit rate (input-dominant traffic that isn't being cached), and oversized outputs. It turns "AI is expensive" into specific, costed actions.

Attribute AI cost like any other cost

You can't govern what you can't attribute. Bring AI into your existing model:

Guardrails for AI spend

The same governance framework applies, tuned for AI:

FAQ

Is prompt caching really that significant?

For input-dominant workloads, yes — when the same large context is sent on every call, caching those input tokens can remove a large fraction of input cost. The tell-tale sign is input tokens vastly exceeding output tokens with a low cache-hit rate.

Should we always use the cheapest model?

Use the cheapest model that meets your quality bar — which varies by task. The waste is using a frontier model for work a smaller one handles fine. Route hard calls up, keep the rest down.

When does provisioned throughput (PTU) make sense?

At high, steady volume where committed capacity is cheaper per token than pay-as-you-go and you'll actually use it. Under-used PTU is one of the most common AI waste findings — match the commitment to real utilization.

Related reading: cloud unit economics: cost per customer · catch spend spikes with anomaly detection · the cloud cost governance framework