By KP | TZoneLabs | DevOps & Cloud Engineering
We were in the middle of a production scaling event when everything went quiet — in the worst way possible.
No crash. No alert. No obvious Kubernetes error. Just pods stuck in Pending, GitHub Actions
deployment jobs timing out, and the entire team staring at dashboards that were still showing green.
It took us 3 hours to find the root cause. The fix took 15 minutes.
This post walks you through exactly what happened, every command we ran, where we went wrong,
and what we put in place so it never silently fails again.
What Happened
Our application was experiencing a spike in traffic. The Kubernetes Cluster Autoscaler kicked in
and requested new worker nodes from the cloud provider. From Kubernetes’ perspective, everything looked normal:
- The control plane was healthy
- Existing nodes were running fine
- Autoscaler logs showed scaling requests being made
But the new nodes never joined the cluster.
Meanwhile, our CI/CD pipelines kept triggering deployments. New pods were scheduled but had nowhere to run.
They sat in Pending indefinitely. Pipelines timed out. Engineers started investigating —
in all the wrong places.
Hour 1: Looking in the Wrong Places
When pods go Pending, the first instinct is to look inside Kubernetes.
That’s where we wasted an hour.
Step 1 — Check pod status
kubectl get pods --all-namespaces | grep -i pending
We had a growing list of pods in Pending. So we described one:
kubectl describe pod <pod-name> -n <namespace>
The event section showed:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5m default-scheduler 0/3 nodes are available: 3 Insufficient cpu.
So the scheduler couldn’t find a node with enough CPU. Our first assumption: the pods had
too-high resource requests, or we had a resource leak.
Step 2 — Check node resource usage
kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-0-1-100.ec2.internal 1850m 92% 6Gi 85%
ip-10-0-1-101.ec2.internal 1780m 89% 5.8Gi 82%
ip-10-0-1-102.ec2.internal 1900m 95% 7Gi 90%
Nodes were saturated. But we’d already seen the autoscaler requesting new nodes. So where were they?
Step 3 — Check autoscaler logs
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100
The autoscaler logs showed it had requested scale-up events:
I0612 08:14:22.112233 1 scale_up.go:453] Scale-up: setting group NodeGroup/aws:///us-east-1a/ng-xxxxx size to 5
I0612 08:14:22.115344 1 factory.go:33] Event(v1.ObjectReference...): type: 'Normal' reason: 'ScaledUpGroup'
Scale-up was requested. But 20 minutes later, no new nodes.
Step 4 — Check node status
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP
ip-10-0-1-100.ec2.internal Ready <none> 3d v1.28.0 10.0.1.100
ip-10-0-1-101.ec2.internal Ready <none> 3d v1.28.0 10.0.1.101
ip-10-0-1-102.ec2.internal Ready <none> 3d v1.28.0 10.0.1.102
Only 3 nodes — the same 3 that were there before the scale event. Nothing new had joined.
Step 5 — Check cluster events
kubectl get events -n kube-system --sort-by='.lastTimestamp' | tail -30
Nothing obvious. Some FailedScheduling warnings but no provisioning errors inside Kubernetes.
That’s when we should have stopped looking inside the cluster. We didn’t — not for another 40 minutes.
We checked Helm values, deployment manifests, network policies, pod disruption budgets. All clean.
Hour 2: Moving Up the Stack
We finally shifted focus outside Kubernetes and started looking at the cloud layer.
Step 6 — Check the Auto Scaling Group in AWS
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names <your-asg-name> \
--query 'AutoScalingGroups[*].{Min:MinSize,Max:MaxSize,Desired:DesiredCapacity,Instances:Instances[*].InstanceId}' \
--output table
The Desired capacity had been updated to 5 (autoscaler had done its job), but only 3 instances were listed.
Something was preventing the new instances from launching.
Step 7 — Check Auto Scaling Group activity history
aws autoscaling describe-scaling-activities \
--auto-scaling-group-name <your-asg-name> \
--max-items 10 \
--output table
This is where we found the first real clue:
⚠️ StatusCode: Failed
StatusMessage: We currently do not have sufficient capacity for the instance type you requested…⚠️ StatusCode: Failed
StatusMessage: The requested configuration is currently not supported…
Step 8 — Check EC2 instance launch errors
aws ec2 describe-instances \
--filters "Name=tag:aws:autoscaling:groupName,Values=<your-asg-name>" \
"Name=instance-state-name,Values=pending,running" \
--query 'Reservations[*].Instances[*].{ID:InstanceId,State:State.Name,LaunchTime:LaunchTime}' \
--output table
No new instances at all. Nothing even in pending state.
Step 9 — Check service quotas
aws service-quotas list-service-quotas \
--service-code ec2 \
--query 'Quotas[?QuotaName==`Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances`]' \
--output table
# Check current usage vs quota
aws cloudwatch get-metric-statistics \
--namespace AWS/Usage \
--metric-name ResourceCount \
--dimensions Name=Type,Value=Resource Name=Resource,Value=vCPU Name=Service,Value=EC2 Name=Class,Value=Standard/OnDemand \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Maximum \
--output table
We were sitting at 97% of our vCPU quota. The new instances were trying to launch but
hitting the account-level quota limit. The EC2 Auto Scaling group silently failed to launch them.
Step 10 — Check VM extension installation logs via AWS Systems Manager
aws ssm list-command-invocations \
--filters key=Status,value=Failed \
--details \
--query 'CommandInvocations[*].{Instance:InstanceId,Command:CommandId,Status:Status,Output:CommandPlugins[0].Output}' \
--output table
On the instances that did manage to launch (from a previous attempt), a VM extension —
specifically a monitoring agent — was failing to install. This caused the node bootstrap process to hang,
so nodes never registered with the Kubernetes control plane.
🔴 Root cause confirmed: two compounding failures.
- vCPU quota at the AWS account level was nearly exhausted — new instances failed to launch
- VM extension installation failure caused the bootstrap script to hang on instances that did launch,
preventing them from joining the clusterNeither failure produced an alert inside Kubernetes. The control plane was completely unaware.
Hour 3: The Fix
Immediate Fix — Request Quota Increase
aws service-quotas request-service-quota-increase \
--service-code ec2 \
--quota-code L-1216C47A \
--desired-value 512
While waiting for the quota increase (which can take time), we manually terminated a few underutilized
instances in other environments to free up vCPU headroom. That got new nodes launching.
Fix the VM Extension Failure
# SSH into a partially joined node and check bootstrap logs
journalctl -u kubelet --no-pager -n 100
# Check cloud-init logs
cat /var/log/cloud-init-output.log | tail -50
The agent install script was hitting a private endpoint that wasn’t reachable from the new subnet
due to a missing VPC endpoint. We updated the security group rules and the node came up cleanly.
Force Re-registration of Stuck Nodes
# Delete nodes stuck in NotReady/Unknown state
kubectl get nodes | grep -v Ready | awk '{print $1}' | xargs kubectl delete node
# Verify new nodes join
watch kubectl get nodes
Within 10 minutes of the fixes, 5 new nodes joined, pending pods were scheduled,
and deployments resumed.
What We Put in Place After
1. Grafana Dashboard for AWS Quota Utilization
Use CloudWatch metrics via the CloudWatch datasource in Grafana:
{
"metrics": [
[ "AWS/Usage", "ResourceCount",
"Type", "Resource",
"Resource", "vCPU",
"Service", "EC2",
"Class", "Standard/OnDemand" ]
],
"period": 300,
"stat": "Maximum",
"title": "EC2 vCPU Quota Utilization"
}
Set an alert at 80% utilization — not 97%.
2. Alert on Node Provisioning Failures via CloudWatch
aws cloudwatch put-metric-alarm \
--alarm-name "ASG-Launch-Failures" \
--alarm-description "Alert when Auto Scaling Group fails to launch instances" \
--metric-name "GroupFailedLaunchRequests" \
--namespace "AWS/AutoScaling" \
--statistic Sum \
--period 300 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--dimensions Name=AutoScalingGroupName,Value=<your-asg-name> \
--evaluation-periods 1 \
--alarm-actions <your-sns-topic-arn>
3. Monitor Pending Pods by Reason, Not Just Count
kubectl get pods --all-namespaces --field-selector=status.phase=Pending \
-o jsonpath='{range .items[*]}{.metadata.namespace}{" "}{.metadata.name}{" "}{.status.conditions[?(@.type=="PodScheduled")].reason}{"
"}{end}'
A PodScheduled reason of Unschedulable for more than 5 minutes should trigger an alert.
4. Pre-Deployment Capacity Check in the Pipeline
Add this as a step in your GitHub Actions workflow before deploying:
- name: Check cluster capacity
run: |
PENDING=$(kubectl get pods --all-namespaces --field-selector=status.phase=Pending \
--no-headers | wc -l)
if [ "$PENDING" -gt 5 ]; then
echo "ERROR: $PENDING pods are pending. Cluster may not have capacity."
echo "Run: kubectl get pods -A | grep Pending"
exit 1
fi
echo "Cluster capacity check passed. Pending pods: $PENDING"
5. Monitor Node Registration Time
New nodes should join within a defined window. Add this check:
# Check for nodes in NotReady state longer than 5 minutes
kubectl get nodes --no-headers | grep NotReady | while read name status roles age version; do
echo "Node $name has been NotReady for: $age"
done
6. IaC Quota Guardrail with Terraform
resource "aws_autoscaling_group" "eks_nodes" {
max_size = var.asg_max_size
lifecycle {
precondition {
condition = var.asg_max_size * var.instance_vcpus <= var.vcpu_quota_limit * 0.8
error_message = "ASG max size would exceed 80% of vCPU quota. Increase quota first."
}
}
}
7. Updated Runbook — Cross-Layer Debugging Checklist
When pods are stuck in Pending, follow this layered approach:
Layer 1 — Kubernetes
kubectl describe pod <pod> → check Events section
kubectl get events -n <ns> --sort-by='.lastTimestamp'
kubectl top nodes
Layer 2 — Cluster Autoscaler
kubectl logs -n kube-system -l app=cluster-autoscaler | grep -i error
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
Layer 3 — AWS Auto Scaling Group
aws autoscaling describe-scaling-activities --auto-scaling-group-name <asg>
Check: capacity errors, configuration errors, quota errors
Layer 4 — EC2 & Quotas
aws service-quotas list-service-quotas --service-code ec2
aws cloudwatch get-metric-statistics (vCPU usage vs quota)
Layer 5 — Node Bootstrap
journalctl -u kubelet (on the node via SSM)
cat /var/log/cloud-init-output.log
Check: VPC endpoints, security groups, agent install scripts
Key Lessons
-
The control plane being healthy does not mean compute is healthy.
Kubernetes only knows about nodes that have successfully joined. It has no visibility
into provisioning failures at the cloud layer. -
The Cluster Autoscaler does not know why instances failed to launch.
It requested the scale-up and moved on. It is not responsible for cloud-side provisioning errors. -
A green dashboard at the wrong layer is dangerous.
Your Kubernetes dashboard, your pipeline UI, your APM tool — they can all look healthy
while the underlying platform is failing. -
Silent failures are more expensive than noisy ones.
A loud error is fixed in minutes. A silent failure burns 3 hours of engineering time
before anyone finds it. -
Cross-layer observability is not optional.
You need alerts at every layer: cloud quota → ASG activity → node provisioning →
node readiness → pod scheduling → pipeline queue time. One missing layer is one
incident waiting to happen.
Summary
| Layer | What Happened | Tool to Check |
|---|---|---|
| AWS vCPU Quota | Hit account limit — no new EC2 instances | aws service-quotas + CloudWatch |
| EC2 Auto Scaling | Launch requests silently failed | aws autoscaling describe-scaling-activities |
| VM Extension | Bootstrap script hung — nodes never joined | SSM Session Manager + cloud-init logs |
| Kubernetes | Scheduler couldn’t place pods — no nodes available | kubectl describe pod + kubectl get events |
| CI/CD Pipeline | Jobs timed out waiting for capacity | Pipeline logs + queue-time monitoring |
The problem started at Layer 1 (AWS quota). Kubernetes only saw the effect at Layer 4.
We started debugging at Layer 4. That’s 3 lost hours.
What Do You Think?
Have you ever debugged an incident where every individual layer looked healthy,
but the end-to-end service was failing?
Drop your war story in the comments — what was the hidden layer nobody was watching?
Tags:
#Kubernetes #DevOps #AWS #SRE
#IncidentManagement #Observability #EKS #ClusterAutoscaler