We Lost 3 Hours of Production Deployments Because of One Silent Node Provisioning Failure - TZoneLabs We Lost 3 Hours of Production Deployments Because of One Silent Node Provisioning Failure

By KP | TZoneLabs | DevOps & Cloud Engineering

We were in the middle of a production scaling event when everything went quiet — in the worst way possible.
No crash. No alert. No obvious Kubernetes error. Just pods stuck in Pending, GitHub Actions
deployment jobs timing out, and the entire team staring at dashboards that were still showing green.
It took us 3 hours to find the root cause. The fix took 15 minutes.

This post walks you through exactly what happened, every command we ran, where we went wrong,
and what we put in place so it never silently fails again.

What Happened

Our application was experiencing a spike in traffic. The Kubernetes Cluster Autoscaler kicked in
and requested new worker nodes from the cloud provider. From Kubernetes’ perspective, everything looked normal:

The control plane was healthy
Existing nodes were running fine
Autoscaler logs showed scaling requests being made

But the new nodes never joined the cluster.

Meanwhile, our CI/CD pipelines kept triggering deployments. New pods were scheduled but had nowhere to run.
They sat in Pending indefinitely. Pipelines timed out. Engineers started investigating —
in all the wrong places.

Hour 1: Looking in the Wrong Places

When pods go Pending, the first instinct is to look inside Kubernetes.
That’s where we wasted an hour.

Step 1 — Check pod status

kubectl get pods --all-namespaces | grep -i pending

We had a growing list of pods in Pending. So we described one:

kubectl describe pod <pod-name> -n <namespace>

The event section showed:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----                -------
  Warning  FailedScheduling  5m    default-scheduler  0/3 nodes are available: 3 Insufficient cpu.

So the scheduler couldn’t find a node with enough CPU. Our first assumption: the pods had
too-high resource requests, or we had a resource leak.

Step 2 — Check node resource usage

kubectl top nodes

NAME                        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-1-100.ec2.internal   1850m       92%    6Gi             85%
ip-10-0-1-101.ec2.internal   1780m       89%    5.8Gi           82%
ip-10-0-1-102.ec2.internal   1900m       95%    7Gi             90%

Nodes were saturated. But we’d already seen the autoscaler requesting new nodes. So where were they?

Step 3 — Check autoscaler logs

kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100

The autoscaler logs showed it had requested scale-up events:

I0612 08:14:22.112233  1 scale_up.go:453] Scale-up: setting group NodeGroup/aws:///us-east-1a/ng-xxxxx size to 5
I0612 08:14:22.115344  1 factory.go:33] Event(v1.ObjectReference...): type: 'Normal' reason: 'ScaledUpGroup'

Scale-up was requested. But 20 minutes later, no new nodes.

Step 4 — Check node status

kubectl get nodes -o wide

NAME                         STATUS   ROLES    AGE   VERSION   INTERNAL-IP
ip-10-0-1-100.ec2.internal   Ready    <none>   3d    v1.28.0   10.0.1.100
ip-10-0-1-101.ec2.internal   Ready    <none>   3d    v1.28.0   10.0.1.101
ip-10-0-1-102.ec2.internal   Ready    <none>   3d    v1.28.0   10.0.1.102

Only 3 nodes — the same 3 that were there before the scale event. Nothing new had joined.

Step 5 — Check cluster events

kubectl get events -n kube-system --sort-by='.lastTimestamp' | tail -30

Nothing obvious. Some FailedScheduling warnings but no provisioning errors inside Kubernetes.
That’s when we should have stopped looking inside the cluster. We didn’t — not for another 40 minutes.

We checked Helm values, deployment manifests, network policies, pod disruption budgets. All clean.

Hour 2: Moving Up the Stack

We finally shifted focus outside Kubernetes and started looking at the cloud layer.

Step 6 — Check the Auto Scaling Group in AWS

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names <your-asg-name> \
  --query 'AutoScalingGroups[*].{Min:MinSize,Max:MaxSize,Desired:DesiredCapacity,Instances:Instances[*].InstanceId}' \
  --output table

The Desired capacity had been updated to 5 (autoscaler had done its job), but only 3 instances were listed.
Something was preventing the new instances from launching.

Step 7 — Check Auto Scaling Group activity history

aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name <your-asg-name> \
  --max-items 10 \
  --output table

This is where we found the first real clue:

⚠️ StatusCode: Failed
StatusMessage: We currently do not have sufficient capacity for the instance type you requested…

⚠️ StatusCode: Failed
StatusMessage: The requested configuration is currently not supported…

Step 8 — Check EC2 instance launch errors

aws ec2 describe-instances \
  --filters "Name=tag:aws:autoscaling:groupName,Values=<your-asg-name>" \
           "Name=instance-state-name,Values=pending,running" \
  --query 'Reservations[*].Instances[*].{ID:InstanceId,State:State.Name,LaunchTime:LaunchTime}' \
  --output table

No new instances at all. Nothing even in pending state.

Step 9 — Check service quotas

aws service-quotas list-service-quotas \
  --service-code ec2 \
  --query 'Quotas[?QuotaName==`Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances`]' \
  --output table

# Check current usage vs quota
aws cloudwatch get-metric-statistics \
  --namespace AWS/Usage \
  --metric-name ResourceCount \
  --dimensions Name=Type,Value=Resource Name=Resource,Value=vCPU Name=Service,Value=EC2 Name=Class,Value=Standard/OnDemand \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Maximum \
  --output table

We were sitting at 97% of our vCPU quota. The new instances were trying to launch but
hitting the account-level quota limit. The EC2 Auto Scaling group silently failed to launch them.

Step 10 — Check VM extension installation logs via AWS Systems Manager

aws ssm list-command-invocations \
  --filters key=Status,value=Failed \
  --details \
  --query 'CommandInvocations[*].{Instance:InstanceId,Command:CommandId,Status:Status,Output:CommandPlugins[0].Output}' \
  --output table

On the instances that did manage to launch (from a previous attempt), a VM extension —
specifically a monitoring agent — was failing to install. This caused the node bootstrap process to hang,
so nodes never registered with the Kubernetes control plane.

🔴 Root cause confirmed: two compounding failures.

vCPU quota at the AWS account level was nearly exhausted — new instances failed to launch

VM extension installation failure caused the bootstrap script to hang on instances that did launch,
preventing them from joining the cluster

Neither failure produced an alert inside Kubernetes. The control plane was completely unaware.

Hour 3: The Fix

Immediate Fix — Request Quota Increase

aws service-quotas request-service-quota-increase \
  --service-code ec2 \
  --quota-code L-1216C47A \
  --desired-value 512

While waiting for the quota increase (which can take time), we manually terminated a few underutilized
instances in other environments to free up vCPU headroom. That got new nodes launching.

Fix the VM Extension Failure

# SSH into a partially joined node and check bootstrap logs
journalctl -u kubelet --no-pager -n 100

# Check cloud-init logs
cat /var/log/cloud-init-output.log | tail -50

The agent install script was hitting a private endpoint that wasn’t reachable from the new subnet
due to a missing VPC endpoint. We updated the security group rules and the node came up cleanly.

Force Re-registration of Stuck Nodes

# Delete nodes stuck in NotReady/Unknown state
kubectl get nodes | grep -v Ready | awk '{print $1}' | xargs kubectl delete node

# Verify new nodes join
watch kubectl get nodes

Within 10 minutes of the fixes, 5 new nodes joined, pending pods were scheduled,
and deployments resumed.

What We Put in Place After

1. Grafana Dashboard for AWS Quota Utilization

Use CloudWatch metrics via the CloudWatch datasource in Grafana:

{
  "metrics": [
    [ "AWS/Usage", "ResourceCount",
      "Type", "Resource",
      "Resource", "vCPU",
      "Service", "EC2",
      "Class", "Standard/OnDemand" ]
  ],
  "period": 300,
  "stat": "Maximum",
  "title": "EC2 vCPU Quota Utilization"
}

Set an alert at 80% utilization — not 97%.

2. Alert on Node Provisioning Failures via CloudWatch

aws cloudwatch put-metric-alarm \
  --alarm-name "ASG-Launch-Failures" \
  --alarm-description "Alert when Auto Scaling Group fails to launch instances" \
  --metric-name "GroupFailedLaunchRequests" \
  --namespace "AWS/AutoScaling" \
  --statistic Sum \
  --period 300 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --dimensions Name=AutoScalingGroupName,Value=<your-asg-name> \
  --evaluation-periods 1 \
  --alarm-actions <your-sns-topic-arn>

3. Monitor Pending Pods by Reason, Not Just Count

kubectl get pods --all-namespaces --field-selector=status.phase=Pending \
  -o jsonpath='{range .items[*]}{.metadata.namespace}{"	"}{.metadata.name}{"	"}{.status.conditions[?(@.type=="PodScheduled")].reason}{"
"}{end}'

A PodScheduled reason of Unschedulable for more than 5 minutes should trigger an alert.

4. Pre-Deployment Capacity Check in the Pipeline

Add this as a step in your GitHub Actions workflow before deploying:

- name: Check cluster capacity
  run: |
    PENDING=$(kubectl get pods --all-namespaces --field-selector=status.phase=Pending \
      --no-headers | wc -l)

    if [ "$PENDING" -gt 5 ]; then
      echo "ERROR: $PENDING pods are pending. Cluster may not have capacity."
      echo "Run: kubectl get pods -A | grep Pending"
      exit 1
    fi

    echo "Cluster capacity check passed. Pending pods: $PENDING"

5. Monitor Node Registration Time

New nodes should join within a defined window. Add this check:

# Check for nodes in NotReady state longer than 5 minutes
kubectl get nodes --no-headers | grep NotReady | while read name status roles age version; do
  echo "Node $name has been NotReady for: $age"
done

6. IaC Quota Guardrail with Terraform

resource "aws_autoscaling_group" "eks_nodes" {
  max_size = var.asg_max_size

  lifecycle {
    precondition {
      condition     = var.asg_max_size * var.instance_vcpus <= var.vcpu_quota_limit * 0.8
      error_message = "ASG max size would exceed 80% of vCPU quota. Increase quota first."
    }
  }
}

7. Updated Runbook — Cross-Layer Debugging Checklist

When pods are stuck in Pending, follow this layered approach:

Layer 1 — Kubernetes
  kubectl describe pod <pod> → check Events section
  kubectl get events -n <ns> --sort-by='.lastTimestamp'
  kubectl top nodes

Layer 2 — Cluster Autoscaler
  kubectl logs -n kube-system -l app=cluster-autoscaler | grep -i error
  kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

Layer 3 — AWS Auto Scaling Group
  aws autoscaling describe-scaling-activities --auto-scaling-group-name <asg>
  Check: capacity errors, configuration errors, quota errors

Layer 4 — EC2 & Quotas
  aws service-quotas list-service-quotas --service-code ec2
  aws cloudwatch get-metric-statistics (vCPU usage vs quota)

Layer 5 — Node Bootstrap
  journalctl -u kubelet (on the node via SSM)
  cat /var/log/cloud-init-output.log
  Check: VPC endpoints, security groups, agent install scripts

Key Lessons

The control plane being healthy does not mean compute is healthy.
Kubernetes only knows about nodes that have successfully joined. It has no visibility
into provisioning failures at the cloud layer.
The Cluster Autoscaler does not know why instances failed to launch.
It requested the scale-up and moved on. It is not responsible for cloud-side provisioning errors.
A green dashboard at the wrong layer is dangerous.
Your Kubernetes dashboard, your pipeline UI, your APM tool — they can all look healthy
while the underlying platform is failing.
Silent failures are more expensive than noisy ones.
A loud error is fixed in minutes. A silent failure burns 3 hours of engineering time
before anyone finds it.
Cross-layer observability is not optional.
You need alerts at every layer: cloud quota → ASG activity → node provisioning →
node readiness → pod scheduling → pipeline queue time. One missing layer is one
incident waiting to happen.

Summary

Layer	What Happened	Tool to Check
AWS vCPU Quota	Hit account limit — no new EC2 instances	`aws service-quotas` + CloudWatch
EC2 Auto Scaling	Launch requests silently failed	`aws autoscaling describe-scaling-activities`
VM Extension	Bootstrap script hung — nodes never joined	SSM Session Manager + cloud-init logs
Kubernetes	Scheduler couldn’t place pods — no nodes available	`kubectl describe pod` + `kubectl get events`
CI/CD Pipeline	Jobs timed out waiting for capacity	Pipeline logs + queue-time monitoring

The problem started at Layer 1 (AWS quota). Kubernetes only saw the effect at Layer 4.
We started debugging at Layer 4. That’s 3 lost hours.

What Do You Think?

Have you ever debugged an incident where every individual layer looked healthy,
but the end-to-end service was failing?

Drop your war story in the comments — what was the hidden layer nobody was watching?

Tags:
#Kubernetes #DevOps #AWS #SRE
#IncidentManagement #Observability #EKS #ClusterAutoscaler