Why cost management fails without budgets

I’ve worked in Azure environments where nobody knew the monthly bill until finance sent an angry email. No budgets, no alerts, no cost allocation. Just a shared credit card and hope.

Cloud costs are like water damage. By the time you notice, it’s already expensive.

Azure Cost Management gives you the tools to understand and control spend. But those tools only work if you set them up before the bill arrives. In my experience, most organizations don’t have budgets on their subscriptions. The ones that do often only alert at 100%, which is too late to do anything about it.

This post covers what I set up in every Azure environment: budgets with multiple thresholds, tag-based cost allocation enforced through policy, anomaly detection, and a reusable Terraform module that deploys all of it automatically with every landing zone.

  • Set budgets on every subscription and resource group
  • Use tags for cost allocation (mandatory via policy)
  • Enable anomaly alerts for early warning
  • Automate budget creation in landing zone vending
  • Review costs weekly, not monthly

Azure Cost Management overview

Here’s what Cost Management gives you out of the box:

FeatureWhat it doesScope
Cost AnalysisVisualize and break down spendManagement Group, Subscription, RG, Resource
BudgetsSet spend limits with alertsManagement Group, Subscription, RG
AlertsBudget, anomaly, and credit threshold notificationsBudget-based, Subscription
ExportsAutomated data exportTo Storage Account
AdvisorCost optimization recommendationsAll resources

The data flow is straightforward:

Azure Resources (Usage)
Cost Management API
├── Cost Analysis (dashboards)
├── Budgets (alerts)
├── Exports (storage/Power BI)
└── Advisor (recommendations)

One thing to keep in mind: cost data has an 8-24 hour ingestion delay. You won’t see today’s spend in real time. Budget alerts and anomaly detection work on the data as it becomes available, so there’s always a lag between resource usage and when an alert fires.

None of this costs extra. Cost Management is free for Azure resources (you only pay if you’re analyzing AWS costs through the same tool).

Budget implementation

One important thing to understand: budget alerts are notifications only. They don’t stop anyone from deploying resources or spending more money. A budget at 100% doesn’t block deployments. It tells you the money is gone. If you need hard spending limits, you’ll need to combine budgets with Azure Policy (deny expensive SKUs) or custom automation that reacts to alerts.

One threshold isn’t enough. I use four:

  • 50% forecasted: early warning that you’re trending high. You still have time to act.
  • 80% actual: something needs attention. Investigate what’s driving the spend.
  • 100% actual: budget hit. Action group fires, ticket gets created.
  • 120% actual: budget exceeded. This should wake someone up.

The 50% forecasted alert is the one people skip, and it’s the most useful. Azure projects your spend based on current consumption patterns and warns you before you actually hit the threshold.

Subscription budget

resource "azurerm_consumption_budget_subscription" "main" {
name = "budget-${var.subscription_name}"
subscription_id = data.azurerm_subscription.current.id
amount = var.monthly_budget
time_grain = "Monthly"
time_period {
start_date = formatdate("YYYY-MM-01'T'00:00:00Z", timestamp())
end_date = "2030-12-31T00:00:00Z"
}
# Alert at 50% (forecast)
notification {
enabled = true
threshold = 50
operator = "GreaterThan"
threshold_type = "Forecasted"
contact_emails = var.cost_alert_emails
}
# Alert at 80% (actual)
notification {
enabled = true
threshold = 80
operator = "GreaterThan"
threshold_type = "Actual"
contact_emails = var.cost_alert_emails
}
# Alert at 100% (actual)
notification {
enabled = true
threshold = 100
operator = "GreaterThan"
threshold_type = "Actual"
contact_emails = var.cost_alert_emails
contact_groups = [
azurerm_monitor_action_group.cost_critical.id
]
}
# Alert at 120% (actual) - budget exceeded
notification {
enabled = true
threshold = 120
operator = "GreaterThan"
threshold_type = "Actual"
contact_emails = var.cost_alert_emails
contact_groups = [
azurerm_monitor_action_group.cost_critical.id
]
}
lifecycle {
ignore_changes = [time_period]
}
}

The lifecycle block is important. Without ignore_changes on time_period, Terraform would try to update the start date on every apply, since timestamp() changes each run.

Tip: If you’re on Terraform 1.5+, consider using plantimestamp() instead of timestamp(). It returns the same value throughout the entire plan, which makes plan output more predictable and avoids unnecessary diffs in other resources that reference the same timestamp.

Notice that the 100% and 120% thresholds include contact_groups in addition to email. This triggers the action group (defined in the Action group for cost alerts section below), which can create tickets, fire webhooks, or run automation. For the lower thresholds, email is enough.

Resource group budget

For workload-specific budgets, you can scope to a resource group and optionally filter by resource type:

resource "azurerm_consumption_budget_resource_group" "workload" {
name = "budget-${azurerm_resource_group.main.name}"
resource_group_id = azurerm_resource_group.main.id
amount = var.workload_budget
time_grain = "Monthly"
time_period {
start_date = formatdate("YYYY-MM-01'T'00:00:00Z", timestamp())
end_date = "2030-12-31T00:00:00Z"
}
notification {
enabled = true
threshold = 80
operator = "GreaterThan"
threshold_type = "Actual"
contact_emails = var.workload_owner_emails
}
notification {
enabled = true
threshold = 100
operator = "GreaterThan"
threshold_type = "Actual"
contact_emails = var.workload_owner_emails
contact_groups = [azurerm_monitor_action_group.cost_workload.id]
}
# Optional: filter by specific resource types
filter {
dimension {
name = "ResourceType"
values = [
"Microsoft.Compute/virtualMachines",
"Microsoft.Storage/storageAccounts",
"Microsoft.ContainerService/managedClusters"
]
}
}
}

The filter is optional. Without it, the budget covers everything in the resource group. I use filters when I want a separate budget tracking just the compute or just the storage costs within a resource group, so teams can see which category is driving their spend.

Action group for cost alerts

This is what turns a budget alert from “email nobody reads” into “ticket in ServiceNow and automation that reacts”:

resource "azurerm_monitor_action_group" "cost_critical" {
name = "ag-cost-critical"
resource_group_name = azurerm_resource_group.management.name
short_name = "costcrit"
email_receiver {
name = "finance-team"
email_address = "finance@company.com"
}
email_receiver {
name = "platform-team"
email_address = "platform@company.com"
}
# Webhook to ticketing system
webhook_receiver {
name = "servicenow"
service_uri = var.servicenow_webhook_url
}
# Logic App for automated actions
logic_app_receiver {
name = "cost-automation"
resource_id = azurerm_logic_app_workflow.cost_automation.id
callback_url = azurerm_logic_app_trigger_http_request.cost.callback_url
use_common_alert_schema = true
}
}

The Logic App receiver is where it gets interesting. You can build automation that reacts to cost alerts: shut down dev VMs, scale down non-prod AKS clusters, or at minimum create an incident ticket with all the context attached.

Cost allocation with tags

“Who spent $50,000 last month?” Without tags, you’ll never answer that question.

I’ve seen organizations with hundreds of subscriptions and no tagging standard. Cost reviews turn into detective work where nobody can figure out which team or project is responsible for the spike.

Tag schema

Here’s the tagging standard I typically implement:

Required Tags:
├── cost-center: "CC-12345" (finance code)
├── owner: "team-name" or "user@company.com"
├── project: "project-name"
├── environment: "prod|staging|dev|sandbox"
└── application: "app-name"
Optional Tags:
├── created-by: "terraform|manual|pipeline"
├── created-date: "2026-01-15"
├── expiry-date: "2026-12-31" (for temp resources)
└── data-classification: "public|internal|confidential"

Don’t overlook expiry-date. I use it for project-specific resources, dev environments, and anything temporary. A scheduled query can find resources past their expiry date and flag them for cleanup. This single tag has saved more money in my environments than most optimization recommendations.

Enforcing tags with policy

Tags are only useful if they’re consistent. That means policy enforcement. If you’re new to Azure Policy, my governance framework post covers the fundamentals. Here, we need two policies: one that denies resources without required tags, and another that inherits tags from the resource group down to child resources:

# Require cost allocation tags
resource "azurerm_policy_definition" "require_cost_tags" {
name = "require-cost-allocation-tags"
policy_type = "Custom"
mode = "Indexed"
display_name = "Require Cost Allocation Tags"
metadata = jsonencode({
category = "Tags"
})
policy_rule = jsonencode({
if = {
anyOf = [
{
field = "tags['cost-center']"
exists = "false"
},
{
field = "tags['owner']"
exists = "false"
},
{
field = "tags['project']"
exists = "false"
}
]
}
then = {
effect = "deny"
}
})
}
# Inherit tags from resource group
resource "azurerm_policy_definition" "inherit_tags" {
name = "inherit-tag-from-rg"
policy_type = "Custom"
mode = "Indexed"
display_name = "Inherit Tag from Resource Group"
parameters = jsonencode({
tagName = {
type = "String"
metadata = {
displayName = "Tag Name"
description = "Name of the tag to inherit"
}
}
})
policy_rule = jsonencode({
if = {
allOf = [
{
field = "[concat('tags[', parameters('tagName'), ']')]"
exists = "false"
},
{
value = "[resourceGroup().tags[parameters('tagName')]]"
notEquals = ""
}
]
}
then = {
effect = "modify"
details = {
roleDefinitionIds = [
"/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"
]
operations = [
{
operation = "addOrReplace"
field = "[concat('tags[', parameters('tagName'), ']')]"
value = "[resourceGroup().tags[parameters('tagName')]]"
}
]
}
}
})
}

The tag inheritance policy is the one that saves people the most frustration. Teams set tags on the resource group, and the modify effect automatically copies them down to individual resources. No more “we tagged the RG but Cost Analysis shows untagged resources.”

Policy initiative

Bundle the require and inherit policies into a single initiative. If you’re managing policies at scale, consider using EPAC (Enterprise Policy as Code) to version-control and deploy your policy definitions:

resource "azurerm_policy_set_definition" "cost_tags" {
name = "cost-tags-initiative"
policy_type = "Custom"
display_name = "Cost Allocation Tags Initiative"
policy_definition_reference {
policy_definition_id = azurerm_policy_definition.require_cost_tags.id
}
dynamic "policy_definition_reference" {
for_each = ["cost-center", "owner", "project", "environment"]
content {
policy_definition_id = azurerm_policy_definition.inherit_tags.id
parameter_values = jsonencode({
tagName = { value = policy_definition_reference.value }
})
}
}
}
resource "azurerm_management_group_policy_assignment" "cost_tags" {
name = "cost-tags-assignment"
management_group_id = azurerm_management_group.landing_zones.id
policy_definition_id = azurerm_policy_set_definition.cost_tags.id
identity {
type = "SystemAssigned"
}
location = "westeurope"
}

The modify effect in the tag inheritance policy requires a managed identity with write permissions on tags. Add a role assignment so the policy can actually apply changes:

resource "azurerm_role_assignment" "cost_tags_tag_contributor" {
scope = azurerm_management_group.landing_zones.id
role_definition_name = "Tag Contributor"
principal_id = azurerm_management_group_policy_assignment.cost_tags.identity[0].principal_id
}

Assign this at the Landing Zones management group and every subscription underneath gets consistent tagging. If you’re just getting started with tagging, consider using audit instead of deny first to understand your current state before you start blocking deployments.

Anomaly detection

Budgets catch predictable overspend. Anomaly detection catches the unexpected stuff: a developer who forgot to shut down a GPU VM over the weekend, an autoscaler that went haywire, or a storage account with runaway egress.

Built-in anomaly alerts

Azure Cost Management has built-in anomaly detection. You can configure it via the portal under Cost Management > Cost alerts, or automate it with azapi:

resource "azapi_resource" "cost_anomaly_alert" {
type = "Microsoft.CostManagement/scheduledActions@2023-11-01"
name = "cost-anomaly-alert"
parent_id = data.azurerm_subscription.current.id
body = jsonencode({
kind = "InsightAlert"
properties = {
displayName = "Daily Cost Anomaly Alert"
status = "Enabled"
viewId = "/subscriptions/${data.azurerm_subscription.current.subscription_id}/providers/Microsoft.CostManagement/views/ms:DailyAnomalyByResourceGroup"
schedule = {
frequency = "Daily"
hourOfDay = 8
daysOfWeek = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
startDate = formatdate("YYYY-MM-DD'T'00:00:00Z", timestamp())
endDate = "2030-12-31T00:00:00Z"
}
notification = {
to = var.cost_alert_emails
subject = "Azure Cost Anomaly Detected"
}
}
})
}

This uses the azapi provider for scheduled actions. However, if you only need anomaly alerts (not scheduled reports), the azurerm provider now has a native resource that’s simpler to manage:

resource "azurerm_cost_anomaly_alert" "main" {
name = "cost-anomaly-alert"
display_name = "Cost Anomaly Alert"
email_subject = "Azure Cost Anomaly Detected"
email_addresses = var.cost_alert_emails
}

Use azurerm_cost_anomaly_alert when you can. Fall back to azapi only if you need scheduled cost reports or custom view IDs.

Custom anomaly detection with Log Analytics

The built-in alerts are good for catching obvious spikes. For more control, export cost data to a storage account and build custom KQL queries:

resource "azurerm_subscription_cost_management_export" "daily" {
name = "daily-cost-export"
subscription_id = data.azurerm_subscription.current.id
recurrence_type = "Daily"
recurrence_period_start_date = formatdate("YYYY-MM-DD'T'00:00:00Z", timestamp())
recurrence_period_end_date = "2030-12-31T00:00:00Z"
export_data_storage_location {
container_id = azurerm_storage_container.cost_exports.resource_manager_id
root_folder_path = "daily"
}
export_data_options {
type = "ActualCost"
time_frame = "MonthToDate"
}
}

The export above writes CSV files to a storage account. To query this data with KQL, you need an ingestion pipeline (e.g., Azure Data Factory or a Logic App) that loads the exported CSVs into a Log Analytics workspace custom table. Once that pipeline is running, you can alert on cost spikes per resource group:

resource "azurerm_monitor_scheduled_query_rules_alert_v2" "cost_spike" {
name = "alert-cost-spike"
resource_group_name = azurerm_resource_group.management.name
location = azurerm_resource_group.management.location
evaluation_frequency = "P1D"
window_duration = "P1D"
scopes = [azurerm_log_analytics_workspace.platform.id]
severity = 2
criteria {
query = <<-QUERY
// Custom cost data ingested from exports.
// Field names depend on your ingestion pipeline. Azure cost exports use
// CostInBillingCurrency and ResourceGroupName as column headers.
CostData_CL
| where TimeGenerated > ago(1d)
| summarize TodayCost = sum(CostInBillingCurrency_d) by ResourceGroupName_s
| join kind=inner (
CostData_CL
| where TimeGenerated > ago(8d) and TimeGenerated < ago(1d)
| summarize AvgCost = avg(CostInBillingCurrency_d) by ResourceGroupName_s
) on ResourceGroupName_s
| where TodayCost > AvgCost * 1.5 // 50% spike
| project ResourceGroupName_s, TodayCost, AvgCost, Increase = (TodayCost - AvgCost) / AvgCost * 100
QUERY
time_aggregation_method = "Count"
threshold = 0
operator = "GreaterThan"
}
action {
action_groups = [azurerm_monitor_action_group.cost_critical.id]
}
}

The query compares today’s cost per resource group against the 7-day average. Anything with a 50%+ spike gets flagged. The _d and _s suffixes are Log Analytics type indicators (double and string) that get appended when you ingest CSV data into a custom table.

You can adjust the multiplier based on how noisy your environment is. I’ve found 1.5x works well for production subscriptions, while dev subscriptions might need 2x or higher because spend patterns are less predictable.

Budget automation module

All the budget code above is fine for a single subscription. But if you’re running landing zone vending, you need a reusable module:

modules/budget/main.tf
variable "name" { type = string }
variable "scope_id" { type = string }
variable "scope_type" {
type = string
validation {
condition = contains(["subscription", "resource_group"], var.scope_type)
error_message = "Scope type must be 'subscription' or 'resource_group'."
}
}
variable "amount" { type = number }
variable "alert_emails" { type = list(string) }
variable "action_group_id" { type = string }
variable "thresholds" {
type = list(object({
threshold = number
threshold_type = string # "Actual" or "Forecasted"
}))
validation {
condition = length(var.thresholds) <= 5
error_message = "Azure supports a maximum of 5 notification thresholds per budget."
}
default = [
{ threshold = 50, threshold_type = "Forecasted" },
{ threshold = 80, threshold_type = "Actual" },
{ threshold = 100, threshold_type = "Actual" },
{ threshold = 120, threshold_type = "Actual" }
]
}
locals {
notifications = [
for idx, t in var.thresholds : {
enabled = true
threshold = t.threshold
operator = "GreaterThan"
threshold_type = t.threshold_type
contact_emails = var.alert_emails
contact_groups = t.threshold >= 100 ? [var.action_group_id] : []
}
]
}
resource "azurerm_consumption_budget_subscription" "this" {
count = var.scope_type == "subscription" ? 1 : 0
name = var.name
subscription_id = var.scope_id
amount = var.amount
time_grain = "Monthly"
time_period {
start_date = formatdate("YYYY-MM-01'T'00:00:00Z", timestamp())
}
dynamic "notification" {
for_each = local.notifications
content {
enabled = notification.value.enabled
threshold = notification.value.threshold
operator = notification.value.operator
threshold_type = notification.value.threshold_type
contact_emails = notification.value.contact_emails
contact_groups = notification.value.contact_groups
}
}
lifecycle {
ignore_changes = [time_period]
}
}
resource "azurerm_consumption_budget_resource_group" "this" {
count = var.scope_type == "resource_group" ? 1 : 0
name = var.name
resource_group_id = var.scope_id
amount = var.amount
time_grain = "Monthly"
time_period {
start_date = formatdate("YYYY-MM-01'T'00:00:00Z", timestamp())
}
dynamic "notification" {
for_each = local.notifications
content {
enabled = notification.value.enabled
threshold = notification.value.threshold
operator = notification.value.operator
threshold_type = notification.value.threshold_type
contact_emails = notification.value.contact_emails
contact_groups = notification.value.contact_groups
}
}
lifecycle {
ignore_changes = [time_period]
}
}

The count trick with scope_type lets you use the same module for both subscription and resource group budgets. The thresholds variable has sensible defaults but can be overridden per landing zone if some workloads need different alert levels.

Using it in landing zone vending

module "landing_zone" {
source = "./modules/landing-zone"
# ... other config
}
module "landing_zone_budget" {
source = "./modules/budget"
name = "budget-${var.workload}-${var.environment}"
scope_type = "subscription"
scope_id = module.landing_zone.subscription_id
amount = var.monthly_budget
alert_emails = var.cost_owners
action_group_id = azurerm_monitor_action_group.cost_critical.id
}

No subscription without cost controls. This should be step 4 in every landing zone vending process, right after governance policies, RBAC, and networking.

Where to go from here

The patterns in this post give you a solid foundation: budgets with multiple thresholds, tag enforcement through policy, anomaly detection, and a reusable module for landing zone vending. But cost management is an ongoing practice, not a one-time setup.

Once your budgets and tags are in place, the next steps are typically:

  • Build a cost dashboard in Power BI or Azure Workbooks that pulls from your daily exports. Give each team a view filtered to their cost center.
  • Add cost gates to your CI/CD pipelines. Tools like Infracost can estimate the cost impact of Terraform changes before they’re applied.
  • Schedule weekly cost reviews. Monthly reviews are post-mortems. Weekly reviews are still actionable. Even a 15-minute standup looking at the top 5 cost movers keeps teams honest.
  • Explore the Azure subnet calculator if you’re designing network architectures alongside your cost planning.

The FinOps Foundation’s FinOps Framework is worth reading if you want to formalize this into a practice across your organization.


Sources

  1. Microsoft, “Azure Cost Management,” Azure Documentation, https://learn.microsoft.com/azure/cost-management-billing/costs/overview-cost-management

  2. Microsoft, “Create and Manage Budgets,” Azure Documentation, https://learn.microsoft.com/azure/cost-management-billing/costs/tutorial-acm-create-budgets

  3. Microsoft, “Cost Anomaly Alerts,” Azure Documentation, https://learn.microsoft.com/azure/cost-management-billing/understand/analyze-unexpected-charges

  4. Microsoft, “Organize Resources with Tags,” Azure Documentation, https://learn.microsoft.com/azure/azure-resource-manager/management/tag-resources

  5. FinOps Foundation, “FinOps Framework,” https://www.finops.org/framework/

  6. Microsoft, “Cost Management in CAF,” CAF Documentation, https://learn.microsoft.com/azure/cloud-adoption-framework/govern/cost-management/