Everyone wants to jump straight into deploying workloads in Azure. Spin up some VMs, create a few App Services, maybe throw in a Kubernetes cluster. The boring foundation work gets skipped because it doesn’t feel productive. Then six months later you’re dealing with security incidents, cost overruns, and a management nightmare that takes weeks to untangle.

I’ve seen this pattern play out multiple times. Building a proper cloud foundation is not exciting, but it saves you from serious problems down the road. Here’s how to actually do it right.

Why This Matters

Your cloud foundation is the difference between a manageable Azure environment and complete chaos. Without proper governance, every team does their own thing. Without proper connectivity, you end up with a rats nest of network configurations. Without proper identity management, you’re playing security whack-a-mole.

The foundation includes governance policies, network topology, identity and access management, security baselines, and management tooling. Get these right at the start and everything else becomes easier. Skip them and you’ll spend months retrofitting basic controls while trying not to break production workloads.

Microsoft has the Cloud Adoption Framework and Azure Landing Zones as references. They’re comprehensive but can be overwhelming. You don’t need to implement everything on day one. Focus on the essentials that prevent the most common problems.

Management Group Hierarchy

Start with your management group structure. This is the top level of your Azure organization hierarchy. Management groups contain subscriptions and let you apply policies and access controls at scale.

A typical structure looks like this:

Tenant Root Group
├── Platform
│ ├── Management
│ ├── Connectivity
│ └── Identity
├── Landing Zones
│ ├── Production
│ └── Non-Production
└── Sandbox

The Platform management group holds your shared infrastructure. Management subscription runs monitoring and automation. Connectivity subscription has your networking hub. Identity subscription handles domain controllers or other identity infrastructure if you need it.

Landing Zones is where actual workloads live. Split by environment at minimum. Some organizations go deeper with business unit or application subdivisions. Don’t over-engineer this. Start simple and refine as needed.

Sandbox is for experimentation. Looser policies, isolated from everything else. Developers need a place to break things without impacting production.

Subscription Strategy

Subscriptions are your primary isolation boundary in Azure. Resource limits, billing boundaries, and access control all work at the subscription level. Plan this carefully because moving resources between subscriptions later is painful.

Start with at least these subscriptions:

Management Subscription: Log Analytics workspace, automation accounts, backup vaults. Everything that monitors and manages your other subscriptions lives here.

Connectivity Subscription: Hub VNet, VPN or ExpressRoute gateways, Azure Firewall, DDoS protection. Your central networking infrastructure.

Production Landing Zone: Production workloads. Could be multiple subscriptions if you need separation by application or business unit.

Non-Production Landing Zone: Development, testing, staging environments. Keep this separated from production for cost tracking and security isolation.

Sandbox Subscription: Individual sandbox subscriptions for teams that need freedom to experiment without breaking things or running up huge bills.

Each subscription needs a clear owner and purpose. Document this. When someone asks “which subscription should this go in?” the answer should be obvious.

Azure Policy for Governance

Azure Policy enforces rules across your environment. This is how you prevent people from doing stupid things accidentally, like deploying resources in the wrong region or creating public IP addresses where they shouldn’t exist.

Essential policies to implement:

{
"properties": {
"displayName": "Allowed Locations",
"policyType": "BuiltIn",
"mode": "All",
"description": "Restrict resource deployment to approved regions",
"parameters": {
"listOfAllowedLocations": {
"type": "Array",
"metadata": {
"description": "The list of allowed locations for resources"
}
}
}
}
}

Apply this at the root management group level. Forces all resources to deploy in your approved regions. Prevents surprise bills from accidentally deploying expensive resources in regions with higher pricing.

Other critical policies:

Require tags: Force resource tagging for cost allocation. Make CostCenter, Environment, and Owner mandatory. Your finance team will thank you.

Restrict VM SKUs: Prevent someone from spinning up massive VMs that cost thousands per month. Whitelist acceptable SKU sizes.

Require encryption: Force encryption at rest for storage accounts, SQL databases, and managed disks. Make it impossible to create unencrypted resources.

Block public endpoints: For production subscriptions, block creation of resources with public IP addresses unless explicitly approved. Force everything through your network hub.

Enforce naming conventions: Use Azure Policy to enforce naming standards. Makes resources easier to identify and manage.

Start with a few critical policies in Audit mode. See what would be blocked, communicate with teams, then switch to Deny mode. Rolling out policies in Deny mode from day one tends to create friction and shadow IT workarounds.

Hub and Spoke Network Topology

Your network architecture matters. The hub and spoke topology is the standard pattern for Azure. Hub VNet contains shared services like firewalls, VPN gateways, and DNS. Spoke VNets contain workloads and peer to the hub.

Hub VNet subnets:

Hub VNet (10.0.0.0/16)
├── GatewaySubnet (10.0.0.0/24) - VPN/ExpressRoute gateway
├── AzureFirewallSubnet (10.0.1.0/24) - Azure Firewall
├── Management (10.0.2.0/24) - Jump boxes, management tools
└── SharedServices (10.0.3.0/24) - DNS, NTP, etc

Spoke VNets:

Production Spoke (10.1.0.0/16)
├── Web Tier (10.1.1.0/24)
├── App Tier (10.1.2.0/24)
└── Data Tier (10.1.3.0/24)
Non-Prod Spoke (10.2.0.0/16)
├── Web Tier (10.2.1.0/24)
├── App Tier (10.2.2.0/24)
└── Data Tier (10.2.3.0/24)

All traffic between spokes flows through the hub. Put Azure Firewall in the hub and force all traffic through it using User Defined Routes. Gives you centralized visibility and control.

Use Azure Virtual WAN if you need to connect multiple regions or have complex connectivity requirements. Standard hub and spoke works fine for most scenarios.

Plan your IP address space carefully. Use RFC 1918 private address space and make sure it doesn’t overlap with on-premises networks if you have hybrid connectivity. Running out of IP space later is a nightmare to fix.

Identity and Access Management

Identity is your security perimeter in the cloud. Azure AD (now Microsoft Entra ID) is the foundation. Set this up properly from the start.

Enable MFA for everyone. No exceptions. Azure AD Conditional Access makes this manageable. Require MFA for all admin accounts immediately. Roll it out to regular users on a reasonable timeline.

Use Azure AD Groups for access control. Never assign permissions directly to users. Create groups, assign permissions to groups, add users to groups. Makes access reviews and offboarding way easier.

Implement Privileged Identity Management (PIM). Time-bound access to admin roles. Nobody should have permanent Global Administrator access. Activate it when needed, it expires automatically. Reduces the blast radius when credentials get compromised.

Service principals and managed identities for automation. Don’t use personal accounts for service-to-service authentication. Managed identities eliminate the need to manage credentials entirely. Use them wherever possible.

Role assignments follow the principle of least privilege:

Terminal window
# Bad - Too broad
New-AzRoleAssignment -SignInName user@company.com -RoleDefinitionName "Contributor" -Scope "/subscriptions/xxx"
# Good - Specific scope and role
New-AzRoleAssignment -SignInName user@company.com -RoleDefinitionName "Virtual Machine Contributor" -Scope "/subscriptions/xxx/resourceGroups/rg-prod-web"

Create custom roles when built-in roles are too permissive. Built-in roles often grant more permissions than needed.

Security Baseline

Security isn’t something you add later. Build it into the foundation.

Azure Security Center (Microsoft Defender for Cloud): Enable Standard tier on all subscriptions. Continuous security assessment and threat protection. Worth every penny.

Azure Sentinel: SIEM and SOAR solution. Collects logs from everything, analyzes them, automates responses to threats. Set this up in your management subscription.

Network Security Groups (NSGs): Apply NSGs to every subnet. Default deny inbound from internet. Explicit allow rules for required traffic only.

Azure Key Vault: Centralized secrets management. Application passwords, certificates, encryption keys all go here. Never hardcode secrets in application code or config files.

Diagnostic Settings: Send diagnostic logs to Log Analytics workspace. Every resource that generates logs should send them to your central logging. You can’t investigate incidents without logs.

Update Management: Automated patching for VMs. Schedule maintenance windows, automatically deploy security updates. Unpatched VMs are the number one way environments get compromised.

Azure Backup: Backup policies for everything critical. VMs, databases, file shares. Test restore procedures regularly. Backups you haven’t tested aren’t backups.

Security configuration as code:

Terminal window
# Enable Microsoft Defender for Cloud
Set-AzSecurityPricing -Name "VirtualMachines" -PricingTier "Standard"
Set-AzSecurityPricing -Name "SqlServers" -PricingTier "Standard"
Set-AzSecurityPricing -Name "AppServices" -PricingTier "Standard"
Set-AzSecurityPricing -Name "StorageAccounts" -PricingTier "Standard"
# Configure Security Center auto-provisioning
Set-AzSecurityAutoProvisioningSetting -Name "default" -EnableAutoProvision

Deploy this to all subscriptions. Automate it so new subscriptions get configured automatically.

Management and Monitoring

You need visibility into what’s happening in your environment. Azure Monitor provides this.

Log Analytics Workspace: Central log repository. Create one workspace in your management subscription. Configure retention based on compliance requirements. 90 days minimum, longer for regulated industries.

Azure Monitor for VMs: Performance metrics, dependency mapping, health monitoring for all VMs. Shows what’s connecting to what, helps troubleshoot performance issues.

Application Insights: Application performance monitoring. Instrument your applications to send telemetry here. Track requests, dependencies, exceptions, custom metrics.

Alerts: Proactive notification when things go wrong. Alert on resource health, performance metrics, security events. Too many alerts and people ignore them, too few and you miss critical issues. Tune carefully.

Workbooks: Custom dashboards in Azure Monitor. Build views that show the metrics that matter for your environment. Share them with the team.

Cost Management: Budget alerts, cost analysis, recommendations. Set budgets at the subscription and resource group level. Get alerted before you blow past your budget, not after.

Automation is critical for consistent management:

Terminal window
# Deploy diagnostic settings to all resources
$resources = Get-AzResource
$workspaceId = "/subscriptions/xxx/resourceGroups/rg-management/providers/Microsoft.OperationalInsights/workspaces/law-central"
foreach ($resource in $resources) {
Set-AzDiagnosticSetting -ResourceId $resource.ResourceId `
-WorkspaceId $workspaceId `
-Enabled $true `
-Name "send-to-law"
}

This ensures every resource sends logs to your central workspace. Run this regularly to catch new resources.

Deployment Automation

Infrastructure as Code is non-negotiable. Bicep or Terraform, pick one and use it consistently. Clicking through the portal to deploy resources doesn’t scale and isn’t repeatable.

Store your IaC templates in Git. Use Azure Pipelines or GitHub Actions for deployment. Every infrastructure change goes through source control and automated deployment. Makes changes trackable and reversible.

Example Bicep for deploying a spoke VNet with proper configuration:

param location string = resourceGroup().location
param vnetName string
param addressPrefix string
param subnets array
param hubVnetId string
resource vnet 'Microsoft.Network/virtualNetworks@2023-05-01' = {
name: vnetName
location: location
properties: {
addressSpace: {
addressPrefixes: [
addressPrefix
]
}
subnets: [for subnet in subnets: {
name: subnet.name
properties: {
addressPrefix: subnet.addressPrefix
networkSecurityGroup: {
id: resourceId('Microsoft.Network/networkSecurityGroups', '${subnet.name}-nsg')
}
}
}]
}
}
resource peering 'Microsoft.Network/virtualNetworks/virtualNetworkPeerings@2023-05-01' = {
parent: vnet
name: 'peer-to-hub'
properties: {
remoteVirtualNetwork: {
id: hubVnetId
}
allowVirtualNetworkAccess: true
allowForwardedTraffic: true
allowGatewayTransit: false
useRemoteGateways: true
}
}

Template like this deploys a spoke VNet with peering to the hub, NSGs attached to subnets, and all the standard configuration. Repeatable and consistent.

Documentation

Document your foundation architecture. Not 100 page Word documents that nobody reads. Practical documentation that helps people understand how things work.

What to document:

  • Management group and subscription structure with purpose of each
  • Network topology with IP address allocations
  • RBAC model and how to request access
  • Naming conventions and tagging standards
  • Deployment process for new workloads
  • Security requirements and policies
  • Monitoring and alerting configuration
  • Incident response procedures

Keep documentation in your Git repository with your IaC templates. Update it when you change the infrastructure. Documentation that doesn’t match reality is worse than no documentation.

The Implementation Reality

You won’t implement all of this on day one. That’s fine. Prioritize based on risk and impact.

Week one priorities:

  1. Management group structure
  2. Initial subscriptions
  3. Basic Azure Policy for location restrictions and tagging
  4. Hub VNet with Azure Firewall
  5. Azure AD groups for access control
  6. Log Analytics workspace
  7. Microsoft Defender for Cloud enabled

Everything else can be iterative. Get the foundation laid, then build on it. The key is having a plan and working toward it systematically.

Common Mistakes to Avoid

Starting without a plan. Deploying a few resources and figuring it out later creates technical debt that’s expensive to fix.

Over-engineering for future scale. Build for current needs with room to grow. Don’t architect for Google-scale when you’re running 50 VMs.

Skipping automation. Manual processes don’t scale and lead to configuration drift. Automate from the start.

Weak RBAC. Giving everyone Contributor access because it’s easier creates security problems. Take the time to set up proper roles.

Ignoring costs. Azure costs add up fast without governance. Budget alerts and policies to restrict expensive resources are essential.

No monitoring strategy. You need to know when things break before your users do. Set up monitoring early.

Moving Forward

A solid cloud foundation isn’t glamorous work, but it’s necessary. It prevents the chaos that makes Azure environments unmaintainable. Invest the time upfront and you’ll spend less time firefighting later.

Start with the basics, automate everything, iterate and improve. Your future self will appreciate having done this properly.

The foundation work never really ends. New services launch, requirements change, threats evolve. Build a foundation that’s flexible enough to adapt while maintaining the core principles of security, governance, and manageability.

Get the boring stuff right and the exciting stuff becomes possible.