
Cloud computing promised to eliminate wasteful capital expenditure. In practice, many organisations have simply replaced unpredictable CAPEX with unpredictable—and often shocking—OPEX. Cloud bills grow 20–30% year-over-year for the average enterprise, and Gartner estimates that organisations waste 35% of their cloud spend on idle or over-provisioned resources. This guide provides a systematic approach to reclaiming that waste.
Understanding Cloud Cost Drivers
Before optimising, you need to understand what is driving your bill:
Compute (60–70% of typical cloud bills)
- Virtual machine instances running at low utilisation
- Instances left running 24/7 when they are only needed during business hours
- Over-provisioned instance sizes chosen for "safety margin" and never reviewed
Storage (15–25%)
- Old snapshots, backups, and AMIs never cleaned up
- Data stored in high-performance tiers that does not require fast access
- Unattached EBS volumes (AWS) or orphaned managed disks (Azure) from deleted VMs
Data Transfer / Egress (5–15%)
- Data transferred out of the cloud to the internet or between regions
- Poorly architected applications making unnecessary cross-region calls
- Missing use of CDN for static content delivery
Managed Services
- Databases, Kubernetes clusters, and load balancers running at low utilisation
- Redundant services deployed in multiple regions without genuine HA need
The FinOps Framework
FinOps (Financial Operations) is the cultural and organisational practice of bringing financial accountability to cloud spending. It involves three teams working together:
- Engineering: Build cost-efficient architectures; understand the cost impact of technical decisions
- Finance: Forecast cloud costs accurately; budget for cloud like opex
- Business: Prioritise investments; make trade-off decisions between cost and performance
FinOps Lifecycle
- Inform: Full visibility into who is spending what on which resources
- Optimise: Identify and eliminate waste; right-size resources
- Operate: Continuous cost management embedded into engineering workflows
Cost Optimisation Tactics
1. Rightsizing Compute
Most organisations provision instances based on peak projected demand and never revisit them. In reality:
- 40–60% of cloud VMs run at < 20% CPU utilisation consistently
- Memory is similarly over-provisioned
Action:
- Review CPU and memory utilisation metrics over a 14–30 day period
- Downsize instances where average utilisation is below 40% and peak is below 60%
- Use AWS Compute Optimizer or Azure Advisor for automated rightsizing recommendations
- Do not resize production without a test period in staging
2. Reserved Instances and Savings Plans
On-demand pricing is the most expensive way to run cloud infrastructure. For workloads with predictable usage, commit to reserved capacity:
| Commitment | Discount vs On-Demand |
|---|---|
| 1-year Reserved Instance (no upfront) | 30–40% |
| 1-year Reserved Instance (all upfront) | 40–50% |
| 3-year Reserved Instance (all upfront) | 50–65% |
| AWS Savings Plans (compute) | 20–66% |
| Azure Hybrid Benefit (Windows Server) | Up to 40% additional |
Strategy: Reserve your stable baseline capacity (running 24/7) and use on-demand or Spot for variable workloads.
3. Spot / Preemptible Instances
AWS Spot Instances and Azure Spot VMs use spare cloud capacity at 60–90% discount vs on-demand pricing. They can be interrupted with 2 minutes' notice, making them suitable for:
- Batch processing and data analytics
- CI/CD build agents
- Stateless, fault-tolerant application tiers with auto-scaling
- Development and test environments
Not suitable for: Stateful databases, synchronous customer-facing APIs, anything without graceful shutdown handling.
4. Auto-Scaling
Implement auto-scaling so you pay for compute only when you need it:
- Scale out during peak demand; scale in (and stop billing) during off-peak periods
- For dev/test environments: schedule automatic shutdown outside business hours (saves 65% on hours-based billing)
- Use predictive auto-scaling (available on AWS and Azure) for workloads with predictable traffic patterns
Quick win: Identify all non-production environments and schedule automatic shutdown 18:00–08:00 weekdays and all weekend. For a 50-instance dev environment, this alone saves ~65% of compute cost.
5. Storage Optimisation
Snapshot and backup hygiene:
- Implement lifecycle policies to automatically delete EBS snapshots / Azure disk snapshots older than 30 days (adjust to your retention policy)
- Audit orphaned volumes (unattached disks) monthly and delete if no longer needed
- Clean up old AMIs (AWS) and custom images (Azure) systematically
Storage tiering:
- Move infrequently accessed data to cheaper tiers: AWS S3 Intelligent-Tiering, Azure Cool/Archive Blob Storage
- Enable S3 Intelligent-Tiering for any bucket where access patterns are uncertain—it automatically moves objects between tiers based on access frequency with no retrieval penalty
Target storage costs: Data accessed daily → SSD/hot tier; accessed monthly → standard; accessed quarterly → cold; accessed rarely → archive (80–95% cheaper than SSD)
6. Tagging and Cost Allocation
Without tagging, cloud bills are a black box. Implement a mandatory tagging policy:
Required tags for every resource:
- `Environment`: production / staging / development / test
- `Owner`: team or individual responsible
- `CostCentre`: department budget code
- `Application`: application or project name
- `Expiry`: for ephemeral resources (auto-delete after date)
Enforce tagging via AWS Service Control Policies or Azure Policy (resources without required tags cannot be created).
Use tag-based cost allocation reports to show each team their actual cloud spend and hold them accountable.
7. Eliminate Zombie Resources
Zombie resources are idle or abandoned cloud assets still generating charges:
- Idle load balancers with no healthy targets
- Elastic IPs (AWS) not associated with running instances (charged when idle)
- Empty S3 buckets with replication enabled
- Stopped VMs (still charged for storage and reserved IPs)
- Unused NAT Gateways
Tool: AWS Cost Explorer, Azure Cost Management + Billing, CloudHealth, and Apptio Cloudability all provide idle resource reports.
8. Architect for Cost
Cost efficiency should be a first-class architectural requirement:
- Serverless (AWS Lambda, Azure Functions): Pay only for execution time; no idle cost. Ideal for event-driven, intermittent workloads
- Containers on managed Kubernetes (EKS, AKS): Higher density than VMs; bin-packing reduces per-workload cost
- CDN for static content: CloudFront/Azure CDN is dramatically cheaper than serving static assets from compute instances
- Regional architecture review: Data transfer between AWS regions is charged; unnecessary cross-region calls add up
Building a FinOps Practice
Immediate Actions (Week 1)
- Enable Cost Explorer (AWS) or Cost Analysis (Azure) and review last 90 days of spend by service, region, and tag
- Identify top 10 most expensive resources — investigate utilisation
- Schedule auto-shutdown for all non-production environments
30-Day Actions
- Complete rightsizing analysis; implement recommendations for 5+ largest instances
- Purchase Reserved Instances for stable production workloads
- Implement tagging policy and enforce via Policy-as-Code
- Clean up orphaned volumes, old snapshots, and idle load balancers
90-Day Actions
- Establish monthly FinOps review cadence with engineering and finance
- Integrate cloud cost reporting into engineering team dashboards
- Set budgets and anomaly detection alerts in cloud cost management tools
- Develop storage lifecycle policies across all environments
Cloud cost optimisation is not a one-time project—it is a continuous discipline. Organisations that embed FinOps practices consistently reduce cloud spend by 20–35% without sacrificing performance or reliability.
