Terraform Module Design: Patterns from Building Azure Modules

The problem with raw resources

I spent the last days building three Terraform modules: a Key Vault, a Windows VM, and an Event Hub module. Three different resource types, three different use cases, but the same set of problems kept coming up.

Every time someone writes azurerm_key_vault directly in their Terraform config, they have to remember: enable purge protection, set the network ACL default to deny, use RBAC instead of legacy access policies, set the minimum TLS version, configure diagnostic settings. Miss any one of those and you’ve got a vault that technically works but doesn’t meet your security baseline.

Multiply that across every team deploying infrastructure, and you get drift. Not the “Terraform state doesn’t match reality” kind of drift. The “every vault is configured slightly differently and nobody knows which ones are actually secure” kind.

That’s why modules exist. Not as thin wrappers around a single resource, but as opinionated building blocks that encode your organization’s decisions about how infrastructure should be configured.

Security-first defaults

The most important design decision in all three modules was the same: make the secure option the default, and make the insecure option require explicit opt-in.

For the Key Vault module, that means:

variable "public_network_access_enabled" {
  type    = bool
  default = false
}

variable "rbac_authorization_enabled" {
  type    = bool
  default = true
}

variable "purge_protection_enabled" {
  type    = bool
  default = true
}

variable "soft_delete_retention_days" {
  type    = number
  default = 90
}

And in the Key Vault resource itself, the network ACL starts with deny-all:

network_acls {
  default_action             = "Deny"
  bypass                     = var.network_acls_bypass
  ip_rules                   = var.network_acls_ip_rules
  virtual_network_subnet_ids = var.network_acls_virtual_network_subnet_ids
}

Nobody has to remember to set these. They’re the defaults. If someone needs public access, they set public_network_access_enabled = true and that shows up in the code review as an explicit decision. The same pattern applies to the Windows VM module with encryption_at_host_enabled = true and identity_type = "SystemAssigned", and the Event Hub module with public_network_access_enabled = false and minimum_tls_version = "1.2".

The goal isn’t to prevent teams from making exceptions. It’s to make exceptions visible.

Input validation that fails early

Terraform’s validation blocks and precondition lifecycle rules are underused. Most modules I’ve seen either have no validation or validate things the Azure API would already reject. The useful validations are the ones that catch logical errors before the plan even runs.

Here’s an example from the Event Hub module. Auto-inflate only works on the Standard SKU, and the maximum throughput units need to be at least as high as the base capacity:

resource "azurerm_eventhub_namespace" "this" {
  # ...

  lifecycle {
    precondition {
      condition     = !var.auto_inflate_enabled || var.sku == "Standard"
      error_message = "auto_inflate_enabled can only be true when sku is 'Standard'."
    }

    precondition {
      condition     = !var.auto_inflate_enabled || var.maximum_throughput_units >= var.capacity
      error_message = "maximum_throughput_units must be >= capacity when auto-inflate is enabled."
    }
  }
}

For the Key Vault, I enforce mutual exclusivity between RBAC and legacy access policies. You can use one or the other, not both:

lifecycle {
  precondition {
    condition     = !var.rbac_authorization_enabled || length(var.access_policies) == 0
    error_message = "access_policies must be empty when rbac_authorization_enabled is true."
  }

  precondition {
    condition     = var.rbac_authorization_enabled || length(var.role_assignments) == 0
    error_message = "role_assignments must be empty when rbac_authorization_enabled is false."
  }
}

Without these, someone would get a confusing Azure API error 20 minutes into an apply. With them, they get a clear message at plan time.

Variable-level validation catches simpler things like naming rules. Key Vault names, for example, have strict constraints:

variable "name" {
  type        = string
  description = "The name of the Key Vault."

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,22}[a-z0-9]$", var.name))
    error_message = "Name must be 3-24 characters, lowercase letters, digits, and hyphens."
  }
}

Conditional resources with count and for_each

Not every deployment needs a private endpoint or diagnostic settings. But when they do, the module should handle it cleanly. I use count for single optional resources and for_each for collections.

Private endpoints follow the same pattern across all three modules:

resource "azurerm_private_endpoint" "this" {
  count = var.private_endpoint != null ? 1 : 0

  name                = "pe-${var.name}"
  location            = var.location
  resource_group_name = local.resource_group_name
  subnet_id           = var.private_endpoint.subnet_id

  private_service_connection {
    name                           = "psc-${var.name}"
    private_connection_resource_id = azurerm_key_vault.this.id
    subresource_names              = ["vault"]
    is_manual_connection           = false
  }

  private_dns_zone_group {
    name                 = "default"
    private_dns_zone_ids = [var.private_endpoint.private_dns_zone_id]
  }
}

The var.private_endpoint is an object that’s null by default. When it’s null, the resource isn’t created. When it’s provided, the caller passes the subnet and DNS zone IDs, and the module handles the rest. The naming convention (pe-, psc-) is consistent across modules.

For the Event Hub module, Event Hubs themselves are a collection. I use a map variable with for_each:

variable "event_hubs" {
  type = map(object({
    partition_count   = optional(number, 4)
    message_retention = optional(number, 7)
    consumer_groups   = optional(list(string), [])
    authorization_rules = optional(list(object({
      name   = string
      send   = optional(bool, false)
      listen = optional(bool, false)
      manage = optional(bool, false)
    })), [])
  }))
  default = {}
}

The optional() function with defaults means callers only need to specify what they want to change. Creating two Event Hubs with different configurations looks like this:

event_hubs = {
  "hub-ingest" = {
    partition_count   = 8
    message_retention = 7
    consumer_groups   = ["processor", "analytics"]
    authorization_rules = [
      { name = "sender", send = true },
      { name = "reader", listen = true }
    ]
  }
  "hub-telemetry" = {
    partition_count = 4
    consumer_groups = ["dashboard"]
  }
}

The tricky part is consumer groups and authorization rules, which are nested under each Event Hub. Terraform’s for_each doesn’t handle nested structures directly, so I flatten them in locals:

locals {
  consumer_groups = {
    for item in flatten([
      for hub_name, hub in var.event_hubs : [
        for cg_name in hub.consumer_groups : {
          key      = "${hub_name}-${cg_name}"
          hub_name = hub_name
          cg_name  = cg_name
        }
      ]
    ]) : item.key => item
  }
}

This creates a flat map keyed by hub_name-cg_name, which for_each can iterate over. Same pattern for authorization rules. It’s verbose, but it gives you stable resource addresses that won’t shift when you add or remove hubs.

The Windows VM as a module boundary example

The VM module is the most complex of the three because VMs touch everything: compute, storage, networking, identity, monitoring, and optionally Active Directory.

The question I kept asking was: what belongs in the module, and what should the caller handle? I landed on this boundary:

Inside the module: The VM itself, NICs (including multi-NIC), data disks, managed identity, domain join extension, Azure Monitor Agent, diagnostic settings, and data collection rule association.

Outside the module: The subnet, NSG, Log Analytics workspace, data collection rules, and the availability set. The caller creates these and passes IDs to the module.

The reasoning is that network infrastructure and monitoring infrastructure are shared across VMs. If the module created its own subnet or workspace, you’d end up with one per VM, which is wrong.

One pattern that came up specifically for the VM was the associate_default_nsg flag:

variable "associate_default_nsg" {
  type        = bool
  default     = false
  description = "Associate the NSG with the default NIC."
}

This exists because network_security_group_id might not be known at plan time (if the NSG is being created in the same apply). Terraform needs to know the count or for_each keys at plan time, so I separated the “should we associate” decision from the “what’s the NSG ID” value. Without this, you’d get count depends on resource attributes that cannot be determined until apply errors.

Testing with Terratest

Every module has a Go test using Terratest that does three things: init, apply, and apply again (idempotence check). The second apply is the one that catches issues. If Terraform wants to change anything on the second apply, your module has a configuration that doesn’t settle to a stable state.

func TestTerraformModule(t *testing.T) {
    t.Parallel()

    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: ".",
    })

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApplyAndIdempotent(t, terraformOptions)
}

The test infrastructure (VNet, subnets, DNS zones, Log Analytics) lives in tests/main.tf alongside the module call. This means the test exercises the module with all optional features enabled: private endpoints, diagnostic settings, RBAC, and monitoring.

Tests run in GitHub Actions on every pull request. Azure authentication uses federated credentials (OpenID Connect), so there are no stored secrets in the CI pipeline. The whole thing takes about 15-20 minutes per module, mostly waiting for Azure to provision and destroy resources.

Consistent patterns across modules

The three modules share a set of conventions that make them predictable:

Naming: pe-{name} for private endpoints, psc-{name} for private service connections, diag-{name} for diagnostic settings. When you look at resources in the portal, you can immediately tell what created them.

File layout: main.tf for the primary resource, then feature-specific files: network.tf, diagnostics.tf, rbac.tf. Variables and outputs in their own files. This matters when modules grow.

Diagnostic settings: Same structure everywhere. Enable with a boolean, provide a workspace ID. Precondition validates the workspace ID is provided when diagnostics are enabled. The module handles which log categories to send.

Pre-commit hooks: All three modules use the same .pre-commit-config.yaml with terraform fmt, terraform-docs (auto-generates README), trivy (security scanning), tflint, and commitizen for conventional commits.

Semantic versioning: Commitizen + semantic-release means version bumps are automatic based on commit messages. feat: → minor bump, fix: → patch, feat!: → major.

What I’d do differently

If I were starting over, I’d change a few things:

Start with the test. Write tests/main.tf first, defining how you want the module to be called. This forces you to think about the interface before the implementation. I did it the other way around and had to refactor the variable structure twice.
Fewer optional features in v0.1. I added private endpoints, diagnostics, and RBAC all in the first version. Shipping the core resource first and adding optional features incrementally would have been faster and easier to review.
Document the “why” in variable descriptions. My variable descriptions say what the variable does. They should say why the default is what it is. "Encryption at host requires feature registration on the subscription" is more useful than "Enable encryption at host.".

These modules plug into a broader landing zone setup. If you’re building something similar, the cloud foundation guide covers the management group and subscription structure, and the governance framework post explains how Azure Policy fits alongside Terraform modules for enforcement. For managing policy across these landing zones, see the EPAC series.

Sources

HashiCorp, “Module Development,” Terraform Documentation, https://developer.hashicorp.com/terraform/language/modules/develop
HashiCorp, “Custom Validation Rules,” Terraform Documentation, https://developer.hashicorp.com/terraform/language/expressions/custom-conditions
Gruntwork, “Terratest,” GitHub, https://github.com/gruntwork-io/terratest
Microsoft, “Azure Key Vault Overview,” Azure Documentation, https://learn.microsoft.com/azure/key-vault/general/overview
Microsoft, “Azure Event Hubs Overview,” Azure Documentation, https://learn.microsoft.com/azure/event-hubs/event-hubs-about
Microsoft, “Windows Virtual Machines in Azure,” Azure Documentation, https://learn.microsoft.com/azure/virtual-machines/windows/overview

Terraform Module Design: Patterns from Building Azure Modules

The problem with raw resources

Security-first defaults

Input validation that fails early

Conditional resources with count and for_each

The Windows VM as a module boundary example

Testing with Terratest

Consistent patterns across modules

What I’d do differently

Sources

Tags

Enjoyed This Article?

The problem with raw resources

Security-first defaults

Input validation that fails early

Conditional resources with count and for_each

The Windows VM as a module boundary example

Testing with Terratest

Consistent patterns across modules

What I’d do differently

Sources

Tags

Share

Related Articles

Welcome: Azure Cloud Engineering & Architecture Blog

Azure Cost Management: Budget Alerts & FinOps with Terraform

Audit Azure TLS Versions Across Subscriptions with PowerShell

Enjoyed This Article?