Why cost management fails without budgets
I’ve worked in Azure environments where nobody knew the monthly bill until finance sent an angry email. No budgets, no alerts, no cost allocation. Just a shared credit card and hope.
Cloud costs are like water damage. By the time you notice, it’s already expensive.
Azure Cost Management gives you the tools to understand and control spend. But those tools only work if you set them up before the bill arrives. In my experience, most organizations don’t have budgets on their subscriptions. The ones that do often only alert at 100%, which is too late to do anything about it.
This post covers what I set up in every Azure environment: budgets with multiple thresholds, tag-based cost allocation enforced through policy, anomaly detection, and a reusable Terraform module that deploys all of it automatically with every landing zone.
- Set budgets on every subscription and resource group
- Use tags for cost allocation (mandatory via policy)
- Enable anomaly alerts for early warning
- Automate budget creation in landing zone vending
- Review costs weekly, not monthly
Azure Cost Management overview
Here’s what Cost Management gives you out of the box:
| Feature | What it does | Scope |
|---|---|---|
| Cost Analysis | Visualize and break down spend | Management Group, Subscription, RG, Resource |
| Budgets | Set spend limits with alerts | Management Group, Subscription, RG |
| Alerts | Budget, anomaly, and credit threshold notifications | Budget-based, Subscription |
| Exports | Automated data export | To Storage Account |
| Advisor | Cost optimization recommendations | All resources |
The data flow is straightforward:
Azure Resources (Usage) │ ▼ Cost Management API │ ├── Cost Analysis (dashboards) ├── Budgets (alerts) ├── Exports (storage/Power BI) └── Advisor (recommendations)One thing to keep in mind: cost data has an 8-24 hour ingestion delay. You won’t see today’s spend in real time. Budget alerts and anomaly detection work on the data as it becomes available, so there’s always a lag between resource usage and when an alert fires.
None of this costs extra. Cost Management is free for Azure resources (you only pay if you’re analyzing AWS costs through the same tool).
Budget implementation
One important thing to understand: budget alerts are notifications only. They don’t stop anyone from deploying resources or spending more money. A budget at 100% doesn’t block deployments. It tells you the money is gone. If you need hard spending limits, you’ll need to combine budgets with Azure Policy (deny expensive SKUs) or custom automation that reacts to alerts.
One threshold isn’t enough. I use four:
- 50% forecasted: early warning that you’re trending high. You still have time to act.
- 80% actual: something needs attention. Investigate what’s driving the spend.
- 100% actual: budget hit. Action group fires, ticket gets created.
- 120% actual: budget exceeded. This should wake someone up.
The 50% forecasted alert is the one people skip, and it’s the most useful. Azure projects your spend based on current consumption patterns and warns you before you actually hit the threshold.
Subscription budget
resource "azurerm_consumption_budget_subscription" "main" { name = "budget-${var.subscription_name}" subscription_id = data.azurerm_subscription.current.id
amount = var.monthly_budget time_grain = "Monthly"
time_period { start_date = formatdate("YYYY-MM-01'T'00:00:00Z", timestamp()) end_date = "2030-12-31T00:00:00Z" }
# Alert at 50% (forecast) notification { enabled = true threshold = 50 operator = "GreaterThan" threshold_type = "Forecasted"
contact_emails = var.cost_alert_emails }
# Alert at 80% (actual) notification { enabled = true threshold = 80 operator = "GreaterThan" threshold_type = "Actual"
contact_emails = var.cost_alert_emails }
# Alert at 100% (actual) notification { enabled = true threshold = 100 operator = "GreaterThan" threshold_type = "Actual"
contact_emails = var.cost_alert_emails
contact_groups = [ azurerm_monitor_action_group.cost_critical.id ] }
# Alert at 120% (actual) - budget exceeded notification { enabled = true threshold = 120 operator = "GreaterThan" threshold_type = "Actual"
contact_emails = var.cost_alert_emails contact_groups = [ azurerm_monitor_action_group.cost_critical.id ] }
lifecycle { ignore_changes = [time_period] }}The lifecycle block is important. Without ignore_changes on time_period, Terraform would try to update the start date on every apply, since timestamp() changes each run.
Tip: If you’re on Terraform 1.5+, consider using
plantimestamp()instead oftimestamp(). It returns the same value throughout the entire plan, which makes plan output more predictable and avoids unnecessary diffs in other resources that reference the same timestamp.
Notice that the 100% and 120% thresholds include contact_groups in addition to email. This triggers the action group (defined in the Action group for cost alerts section below), which can create tickets, fire webhooks, or run automation. For the lower thresholds, email is enough.
Resource group budget
For workload-specific budgets, you can scope to a resource group and optionally filter by resource type:
resource "azurerm_consumption_budget_resource_group" "workload" { name = "budget-${azurerm_resource_group.main.name}" resource_group_id = azurerm_resource_group.main.id
amount = var.workload_budget time_grain = "Monthly"
time_period { start_date = formatdate("YYYY-MM-01'T'00:00:00Z", timestamp()) end_date = "2030-12-31T00:00:00Z" }
notification { enabled = true threshold = 80 operator = "GreaterThan" threshold_type = "Actual"
contact_emails = var.workload_owner_emails }
notification { enabled = true threshold = 100 operator = "GreaterThan" threshold_type = "Actual"
contact_emails = var.workload_owner_emails contact_groups = [azurerm_monitor_action_group.cost_workload.id] }
# Optional: filter by specific resource types filter { dimension { name = "ResourceType" values = [ "Microsoft.Compute/virtualMachines", "Microsoft.Storage/storageAccounts", "Microsoft.ContainerService/managedClusters" ] } }}The filter is optional. Without it, the budget covers everything in the resource group. I use filters when I want a separate budget tracking just the compute or just the storage costs within a resource group, so teams can see which category is driving their spend.
Action group for cost alerts
This is what turns a budget alert from “email nobody reads” into “ticket in ServiceNow and automation that reacts”:
resource "azurerm_monitor_action_group" "cost_critical" { name = "ag-cost-critical" resource_group_name = azurerm_resource_group.management.name short_name = "costcrit"
email_receiver { name = "finance-team" email_address = "finance@company.com" }
email_receiver { name = "platform-team" email_address = "platform@company.com" }
# Webhook to ticketing system webhook_receiver { name = "servicenow" service_uri = var.servicenow_webhook_url }
# Logic App for automated actions logic_app_receiver { name = "cost-automation" resource_id = azurerm_logic_app_workflow.cost_automation.id callback_url = azurerm_logic_app_trigger_http_request.cost.callback_url use_common_alert_schema = true }}The Logic App receiver is where it gets interesting. You can build automation that reacts to cost alerts: shut down dev VMs, scale down non-prod AKS clusters, or at minimum create an incident ticket with all the context attached.
Cost allocation with tags
“Who spent $50,000 last month?” Without tags, you’ll never answer that question.
I’ve seen organizations with hundreds of subscriptions and no tagging standard. Cost reviews turn into detective work where nobody can figure out which team or project is responsible for the spike.
Tag schema
Here’s the tagging standard I typically implement:
Required Tags:├── cost-center: "CC-12345" (finance code)├── owner: "team-name" or "user@company.com"├── project: "project-name"├── environment: "prod|staging|dev|sandbox"└── application: "app-name"
Optional Tags:├── created-by: "terraform|manual|pipeline"├── created-date: "2026-01-15"├── expiry-date: "2026-12-31" (for temp resources)└── data-classification: "public|internal|confidential"Don’t overlook
expiry-date. I use it for project-specific resources, dev environments, and anything temporary. A scheduled query can find resources past their expiry date and flag them for cleanup. This single tag has saved more money in my environments than most optimization recommendations.
Enforcing tags with policy
Tags are only useful if they’re consistent. That means policy enforcement. If you’re new to Azure Policy, my governance framework post covers the fundamentals. Here, we need two policies: one that denies resources without required tags, and another that inherits tags from the resource group down to child resources:
# Require cost allocation tagsresource "azurerm_policy_definition" "require_cost_tags" { name = "require-cost-allocation-tags" policy_type = "Custom" mode = "Indexed" display_name = "Require Cost Allocation Tags"
metadata = jsonencode({ category = "Tags" })
policy_rule = jsonencode({ if = { anyOf = [ { field = "tags['cost-center']" exists = "false" }, { field = "tags['owner']" exists = "false" }, { field = "tags['project']" exists = "false" } ] } then = { effect = "deny" } })}
# Inherit tags from resource groupresource "azurerm_policy_definition" "inherit_tags" { name = "inherit-tag-from-rg" policy_type = "Custom" mode = "Indexed" display_name = "Inherit Tag from Resource Group"
parameters = jsonencode({ tagName = { type = "String" metadata = { displayName = "Tag Name" description = "Name of the tag to inherit" } } })
policy_rule = jsonencode({ if = { allOf = [ { field = "[concat('tags[', parameters('tagName'), ']')]" exists = "false" }, { value = "[resourceGroup().tags[parameters('tagName')]]" notEquals = "" } ] } then = { effect = "modify" details = { roleDefinitionIds = [ "/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c" ] operations = [ { operation = "addOrReplace" field = "[concat('tags[', parameters('tagName'), ']')]" value = "[resourceGroup().tags[parameters('tagName')]]" } ] } } })}The tag inheritance policy is the one that saves people the most frustration. Teams set tags on the resource group, and the modify effect automatically copies them down to individual resources. No more “we tagged the RG but Cost Analysis shows untagged resources.”
Policy initiative
Bundle the require and inherit policies into a single initiative. If you’re managing policies at scale, consider using EPAC (Enterprise Policy as Code) to version-control and deploy your policy definitions:
resource "azurerm_policy_set_definition" "cost_tags" { name = "cost-tags-initiative" policy_type = "Custom" display_name = "Cost Allocation Tags Initiative"
policy_definition_reference { policy_definition_id = azurerm_policy_definition.require_cost_tags.id }
dynamic "policy_definition_reference" { for_each = ["cost-center", "owner", "project", "environment"] content { policy_definition_id = azurerm_policy_definition.inherit_tags.id parameter_values = jsonencode({ tagName = { value = policy_definition_reference.value } }) } }}
resource "azurerm_management_group_policy_assignment" "cost_tags" { name = "cost-tags-assignment" management_group_id = azurerm_management_group.landing_zones.id policy_definition_id = azurerm_policy_set_definition.cost_tags.id
identity { type = "SystemAssigned" }
location = "westeurope"}The modify effect in the tag inheritance policy requires a managed identity with write permissions on tags. Add a role assignment so the policy can actually apply changes:
resource "azurerm_role_assignment" "cost_tags_tag_contributor" { scope = azurerm_management_group.landing_zones.id role_definition_name = "Tag Contributor" principal_id = azurerm_management_group_policy_assignment.cost_tags.identity[0].principal_id}Assign this at the Landing Zones management group and every subscription underneath gets consistent tagging. If you’re just getting started with tagging, consider using audit instead of deny first to understand your current state before you start blocking deployments.
Anomaly detection
Budgets catch predictable overspend. Anomaly detection catches the unexpected stuff: a developer who forgot to shut down a GPU VM over the weekend, an autoscaler that went haywire, or a storage account with runaway egress.
Built-in anomaly alerts
Azure Cost Management has built-in anomaly detection. You can configure it via the portal under Cost Management > Cost alerts, or automate it with azapi:
resource "azapi_resource" "cost_anomaly_alert" { type = "Microsoft.CostManagement/scheduledActions@2023-11-01" name = "cost-anomaly-alert" parent_id = data.azurerm_subscription.current.id
body = jsonencode({ kind = "InsightAlert" properties = { displayName = "Daily Cost Anomaly Alert" status = "Enabled" viewId = "/subscriptions/${data.azurerm_subscription.current.subscription_id}/providers/Microsoft.CostManagement/views/ms:DailyAnomalyByResourceGroup" schedule = { frequency = "Daily" hourOfDay = 8 daysOfWeek = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"] startDate = formatdate("YYYY-MM-DD'T'00:00:00Z", timestamp()) endDate = "2030-12-31T00:00:00Z" } notification = { to = var.cost_alert_emails subject = "Azure Cost Anomaly Detected" } } })}This uses the azapi provider for scheduled actions. However, if you only need anomaly alerts (not scheduled reports), the azurerm provider now has a native resource that’s simpler to manage:
resource "azurerm_cost_anomaly_alert" "main" { name = "cost-anomaly-alert" display_name = "Cost Anomaly Alert" email_subject = "Azure Cost Anomaly Detected" email_addresses = var.cost_alert_emails}Use azurerm_cost_anomaly_alert when you can. Fall back to azapi only if you need scheduled cost reports or custom view IDs.
Custom anomaly detection with Log Analytics
The built-in alerts are good for catching obvious spikes. For more control, export cost data to a storage account and build custom KQL queries:
resource "azurerm_subscription_cost_management_export" "daily" { name = "daily-cost-export" subscription_id = data.azurerm_subscription.current.id recurrence_type = "Daily" recurrence_period_start_date = formatdate("YYYY-MM-DD'T'00:00:00Z", timestamp()) recurrence_period_end_date = "2030-12-31T00:00:00Z"
export_data_storage_location { container_id = azurerm_storage_container.cost_exports.resource_manager_id root_folder_path = "daily" }
export_data_options { type = "ActualCost" time_frame = "MonthToDate" }}The export above writes CSV files to a storage account. To query this data with KQL, you need an ingestion pipeline (e.g., Azure Data Factory or a Logic App) that loads the exported CSVs into a Log Analytics workspace custom table. Once that pipeline is running, you can alert on cost spikes per resource group:
resource "azurerm_monitor_scheduled_query_rules_alert_v2" "cost_spike" { name = "alert-cost-spike" resource_group_name = azurerm_resource_group.management.name location = azurerm_resource_group.management.location
evaluation_frequency = "P1D" window_duration = "P1D" scopes = [azurerm_log_analytics_workspace.platform.id] severity = 2
criteria { query = <<-QUERY // Custom cost data ingested from exports. // Field names depend on your ingestion pipeline. Azure cost exports use // CostInBillingCurrency and ResourceGroupName as column headers. CostData_CL | where TimeGenerated > ago(1d) | summarize TodayCost = sum(CostInBillingCurrency_d) by ResourceGroupName_s | join kind=inner ( CostData_CL | where TimeGenerated > ago(8d) and TimeGenerated < ago(1d) | summarize AvgCost = avg(CostInBillingCurrency_d) by ResourceGroupName_s ) on ResourceGroupName_s | where TodayCost > AvgCost * 1.5 // 50% spike | project ResourceGroupName_s, TodayCost, AvgCost, Increase = (TodayCost - AvgCost) / AvgCost * 100 QUERY
time_aggregation_method = "Count" threshold = 0 operator = "GreaterThan" }
action { action_groups = [azurerm_monitor_action_group.cost_critical.id] }}The query compares today’s cost per resource group against the 7-day average. Anything with a 50%+ spike gets flagged. The _d and _s suffixes are Log Analytics type indicators (double and string) that get appended when you ingest CSV data into a custom table.
You can adjust the multiplier based on how noisy your environment is. I’ve found 1.5x works well for production subscriptions, while dev subscriptions might need 2x or higher because spend patterns are less predictable.
Budget automation module
All the budget code above is fine for a single subscription. But if you’re running landing zone vending, you need a reusable module:
variable "name" { type = string }variable "scope_id" { type = string }variable "scope_type" { type = string validation { condition = contains(["subscription", "resource_group"], var.scope_type) error_message = "Scope type must be 'subscription' or 'resource_group'." }}variable "amount" { type = number }variable "alert_emails" { type = list(string) }variable "action_group_id" { type = string }
variable "thresholds" { type = list(object({ threshold = number threshold_type = string # "Actual" or "Forecasted" })) validation { condition = length(var.thresholds) <= 5 error_message = "Azure supports a maximum of 5 notification thresholds per budget." } default = [ { threshold = 50, threshold_type = "Forecasted" }, { threshold = 80, threshold_type = "Actual" }, { threshold = 100, threshold_type = "Actual" }, { threshold = 120, threshold_type = "Actual" } ]}
locals { notifications = [ for idx, t in var.thresholds : { enabled = true threshold = t.threshold operator = "GreaterThan" threshold_type = t.threshold_type contact_emails = var.alert_emails contact_groups = t.threshold >= 100 ? [var.action_group_id] : [] } ]}
resource "azurerm_consumption_budget_subscription" "this" { count = var.scope_type == "subscription" ? 1 : 0 name = var.name subscription_id = var.scope_id
amount = var.amount time_grain = "Monthly"
time_period { start_date = formatdate("YYYY-MM-01'T'00:00:00Z", timestamp()) }
dynamic "notification" { for_each = local.notifications content { enabled = notification.value.enabled threshold = notification.value.threshold operator = notification.value.operator threshold_type = notification.value.threshold_type contact_emails = notification.value.contact_emails contact_groups = notification.value.contact_groups } }
lifecycle { ignore_changes = [time_period] }}
resource "azurerm_consumption_budget_resource_group" "this" { count = var.scope_type == "resource_group" ? 1 : 0 name = var.name resource_group_id = var.scope_id
amount = var.amount time_grain = "Monthly"
time_period { start_date = formatdate("YYYY-MM-01'T'00:00:00Z", timestamp()) }
dynamic "notification" { for_each = local.notifications content { enabled = notification.value.enabled threshold = notification.value.threshold operator = notification.value.operator threshold_type = notification.value.threshold_type contact_emails = notification.value.contact_emails contact_groups = notification.value.contact_groups } }
lifecycle { ignore_changes = [time_period] }}The count trick with scope_type lets you use the same module for both subscription and resource group budgets. The thresholds variable has sensible defaults but can be overridden per landing zone if some workloads need different alert levels.
Using it in landing zone vending
module "landing_zone" { source = "./modules/landing-zone" # ... other config}
module "landing_zone_budget" { source = "./modules/budget"
name = "budget-${var.workload}-${var.environment}" scope_type = "subscription" scope_id = module.landing_zone.subscription_id amount = var.monthly_budget alert_emails = var.cost_owners action_group_id = azurerm_monitor_action_group.cost_critical.id}No subscription without cost controls. This should be step 4 in every landing zone vending process, right after governance policies, RBAC, and networking.
Where to go from here
The patterns in this post give you a solid foundation: budgets with multiple thresholds, tag enforcement through policy, anomaly detection, and a reusable module for landing zone vending. But cost management is an ongoing practice, not a one-time setup.
Once your budgets and tags are in place, the next steps are typically:
- Build a cost dashboard in Power BI or Azure Workbooks that pulls from your daily exports. Give each team a view filtered to their cost center.
- Add cost gates to your CI/CD pipelines. Tools like Infracost can estimate the cost impact of Terraform changes before they’re applied.
- Schedule weekly cost reviews. Monthly reviews are post-mortems. Weekly reviews are still actionable. Even a 15-minute standup looking at the top 5 cost movers keeps teams honest.
- Explore the Azure subnet calculator if you’re designing network architectures alongside your cost planning.
The FinOps Foundation’s FinOps Framework is worth reading if you want to formalize this into a practice across your organization.
Sources
-
Microsoft, “Azure Cost Management,” Azure Documentation, https://learn.microsoft.com/azure/cost-management-billing/costs/overview-cost-management
-
Microsoft, “Create and Manage Budgets,” Azure Documentation, https://learn.microsoft.com/azure/cost-management-billing/costs/tutorial-acm-create-budgets
-
Microsoft, “Cost Anomaly Alerts,” Azure Documentation, https://learn.microsoft.com/azure/cost-management-billing/understand/analyze-unexpected-charges
-
Microsoft, “Organize Resources with Tags,” Azure Documentation, https://learn.microsoft.com/azure/azure-resource-manager/management/tag-resources
-
FinOps Foundation, “FinOps Framework,” https://www.finops.org/framework/
-
Microsoft, “Cost Management in CAF,” CAF Documentation, https://learn.microsoft.com/azure/cloud-adoption-framework/govern/cost-management/