Guide · 9 min read

The hidden economics of cloud over-provisioning

Most load-balanced services quietly run at 15–25% utilization. Here's the queuing math that explains why, and exactly how to right-size without risking your latency SLO.

If you run a load-balanced web service — an API behind an ALB, a fleet of pods behind a Kubernetes Service, a set of VMs behind a cloud load balancer — there is a good chance you are overpaying by 40% or more, every single month, and your CPU graphs are actively hiding it from you. This isn't a story about reserved instances or spot pricing. It's about a more basic question almost nobody answers correctly: how many replicas does this service actually need?

Why “CPU looks fine” is the most expensive sentence in infra

The default way teams size a service is feedback by fear. Something fell over once, so replicas got bumped from 4 to 6. Then a launch was coming, so 6 became 8. The launch passed; the 8 stayed. Each bump felt responsible. Nobody ever bumped down, because down has a downside you can imagine (an outage) and up has a downside you can't see (a line item buried in a five-figure cloud bill).

The result is a fleet sitting at 18% average CPU, and a dashboard that says, reassuringly, “CPU looks fine.” It is fine. That's the problem. A service at 18% utilization is a service where 82% of what you provisioned is doing nothing but cost money, 730 hours a month.

The checkout-lane model

Forget servers for a second. Picture a supermarket. Requests are shoppers arriving at the front. Replicas are checkout lanes. Each shopper takes some time to check out — that's your per-request service time (your average latency). Queuing theory, the century-old branch of math behind call centers and toll booths, gives us one beautifully simple number called the offered load:

offered load (Erlangs) = arrival rate × service time

example:
  40 requests/sec × 0.055 sec/request = 2.2

That 2.2 is the number that matters. It means that, on average, only 2.2 checkout lanes are ever actually busy at the same time. If you're running 8 lanes for that traffic, six of them are, on average, standing empty. The offered load is the honest floor of your workload, and it's usually a fraction of your replica count.

So why not run exactly 2.2 replicas?

Because averages lie about bursts. Shoppers don't arrive in a smooth trickle; they clump. If you provision exactly at the average, the first clump forms a queue, and queue waiting time behaves non-linearly: it rises gently as utilization climbs toward 70%, then shoots toward infinity as you approach 100%. This is the single most important and least-understood fact in capacity planning. The relevant formula, Erlang-C, estimates the probability an arriving request has to wait at all, and from it the mean queue delay:

mean queue wait ≈ P(wait) / (replicas × service_rate − arrival_rate)

The practical takeaway from that math is a band, not a point. Run a load-balanced service at 55–70% average utilization. Below 55%, you're burning money on idle lanes. Above ~80%, your p95 latency starts ballooning and you're one traffic spike from an incident. The healthy fleet size is offered load ÷ 0.65, rounded up, then floored at whatever HA minimum you need (usually 2 or 3 so a single node failure can't take you down).

For our example: 2.2 ÷ 0.65 ≈ 3.4 → 4 replicas. Not 8. If each replica is a t3.large at ~$0.083/hr, that's 4 × $0.083 × 730 = $242/mo instead of $485/mo. $243 a month, $2,900 a year, reclaimed from one service — with utilization moving to a healthier 55% and your latency SLO still intact.

The five numbers you need

Right-sizing a service correctly requires exactly five inputs, all of which your existing dashboards already show:

Current replica count — how many instances are behind the load balancer right now.
Cost per replica per hour — the on-demand rate for one instance.
Average request rate — requests/sec hitting the load balancer.
Average service time — how long one request takes (your mean latency).
Your p95 latency target — the SLO you refuse to breach.

From those five, the safe replica count is fully determined. You don't need cloud credentials, an agent, or a month of metrics history to get the first answer — you need arithmetic. (You dowant continuous monitoring afterward, because traffic drifts and today's right-size becomes next quarter's overspend — but that's step two.)

A step-by-step right-sizing playbook

Compute offered load: arrival_rate × service_time.
Target fleet size: ceil(offered_load ÷ 0.65), then floor at your HA minimum.
Sanity-check the p95 at that size with Erlang-C (or let a tool do it). If it breaches your SLO, add one replica and re-check.
Set your autoscaler's min replicas to that number and its target to 65% CPU — not a low static threshold that pins you to the floor.
Roll the change during low traffic. Watch p95 across one full peak cycle. Bank the savings.
Re-audit after any traffic-pattern change or instance migration. Utilization economics shift underneath you silently.

The one case where spending more saves money

Right-sizing isn't always “cut replicas.” If your service is already at 75%+ average utilization, you're in the danger zone where a normal Tuesday spike pushes you past 90% and your queue delay — and your p95 — explodes. Here the math tells you to add capacity, and it's the rare infra spend that pays for itself by preventing the SLA penalties, pages, and churn that overload causes. The same five numbers tell you which situation you're in.

Stop guessing — get the number

You can do all of this by hand. Or you can paste your five numbers into ScaleSaver and get the right replica count, the dollar figure you're wasting, the estimated p95 at the new size, and a plain-English explanation you can forward to your team — in about thirty seconds, free, no signup.

See what one service is costing you in idle capacity.

Run the free audit →

ScaleSaver is built for solo founders and small SaaS teams who'd rather spend the $2,900/yr on growth than on idle servers. See pricing →