During a Production Failure, the Real Issue Is Often Not Where the Error Is Showing

By KP  |  TZoneLabs  |  DevOps & Cloud Engineering

Here is something nobody tells you when you start in DevOps: the error message is almost never the problem.
It is just the messenger. A pod crashes — you blame the application. An API times out — you blame the network.
A deployment fails — you blame the pipeline. But after 10 years of production incidents, I have learned that
the layer where the failure shows up and the layer where the failure starts are usually
two completely different things.

This post covers how to debug across the full system path — from user request all the way down to database,
IAM, and CI/CD — with real commands for every layer, two production incident walkthroughs, and a cross-layer
runbook you can use during your next outage.

The Core Problem: Every Layer Can Look Healthy

Modern cloud platforms are made of many independent layers. The problem is that each layer reports
its own health independently. None of them can see the full picture.

Consider these real examples:

What You See What Actually Caused It
Pod keeps crashing Database ran out of connections
API response time spiked DNS resolution failure causing retries
Deployment failed Cloud subscription hit a quota limit
Pipeline looks broken Kubernetes cluster has no schedulable capacity
Application is “down” Expired TLS certificate on the ingress
Service returns 503 Load balancer target group health check misconfigured
Pods not starting ImagePullBackOff due to IAM role missing ECR permissions

In every case, the dashboard at the layer where the symptom shows up looks “almost healthy.”
You need to trace the full path to find where it actually broke.

The Full System Path

Before we look at commands, understand the chain every user request travels through:

User Request
    ↓
DNS Resolution
    ↓
CDN / WAF (CloudFront, Cloudflare, AWS WAF)
    ↓
Load Balancer (ALB / NLB)
    ↓
Ingress Controller / API Gateway
    ↓
Service Mesh (Istio / Linkerd) — if applicable
    ↓
Kubernetes Service
    ↓
Pod (Application Container)
    ↓
Secrets & Config (ConfigMaps, Secrets, SSM Parameter Store)
    ↓
Database / Cache / Queue (RDS, ElastiCache, SQS)
    ↓
IAM & Network Path (Roles, Security Groups, VPC)
    ↓
CI/CD & Rollback (GitHub Actions, ArgoCD, Helm)
    ↓
Observability & Alerting (Prometheus, Grafana, CloudWatch)

When a failure happens, most engineers jump straight to the pod logs. That works maybe 30% of the time.
The other 70% — the real cause is somewhere else in this chain.

Layer-by-Layer Debugging Guide

Layer 1: DNS Resolution

DNS failures are silent killers. They cause timeouts and connection errors that look like application bugs.

# Spawn a debug pod with network tools
kubectl run debug-dns --image=busybox --restart=Never --rm -it -- sh

# Inside the pod — resolve your service
nslookup my-service.default.svc.cluster.local
nslookup google.com

# Check DNS config inside the pod
cat /etc/resolv.conf
# Check CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
kubectl logs -n kube-system -l k8s-app=kube-dns | grep -i "error\|fail\|timeout"

# Check CoreDNS configmap for misconfiguration
kubectl get configmap coredns -n kube-system -o yaml

⚠️ Common Issue: DNS resolution loops. If you see loop in the
CoreDNS config and your node’s /etc/resolv.conf points to a local resolver that forwards back to
CoreDNS, you get an infinite loop. Symptom: intermittent DNS failures under load.

# Check DNS from outside the cluster
dig api.yourdomain.com
nslookup api.yourdomain.com 8.8.8.8

# Check TTL — is the record stale?
dig api.yourdomain.com | grep -i TTL

Layer 2: CDN / WAF

aws cloudfront list-distributions \
  --query 'DistributionList.Items[*].{ID:Id,Domain:DomainName,Status:Status,Origins:Origins.Items[0].DomainName}' \
  --output table

# Check for WAF blocked requests
aws wafv2 get-sampled-requests \
  --web-acl-arn <your-web-acl-arn> \
  --rule-metric-name <rule-name> \
  --scope CLOUDFRONT \
  --time-window StartTime=$(date -u -d '1 hour ago' +%s),EndTime=$(date -u +%s) \
  --max-items 100

⚠️ Note: WAF rules can silently block legitimate traffic during
deployments — especially if your new pods have a different IP range or your load test triggers rate limits.
Always check WAF metrics when you see a sudden drop in successful requests.

aws cloudwatch get-metric-statistics \
  --namespace AWS/CloudFront \
  --metric-name 5xxErrorRate \
  --dimensions Name=DistributionId,Value=<your-dist-id> Name=Region,Value=Global \
  --start-time $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average \
  --output table

Layer 3: Load Balancer (ALB / NLB)

This layer is where silent failures are extremely common. The load balancer looks healthy but the targets
behind it are not.

# List all target groups
aws elbv2 describe-target-groups --output table

# Check health of targets in a specific group
aws elbv2 describe-target-health \
  --target-group-arn <your-target-group-arn> \
  --output table

# Get detailed health reason
aws elbv2 describe-target-health \
  --target-group-arn <your-target-group-arn> \
  --query 'TargetHealthDescriptions[*].{Target:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Description:TargetHealth.Description}' \
  --output table
Reason What It Means Fix
Target.FailedHealthChecks Health check endpoint returning non-200 Check /health endpoint on the pod
Target.NotRegistered Pod not registered with target group Check AWS Load Balancer Controller logs
Target.Timeout Health check timing out Increase health check timeout or fix slow startup
Elb.InternalError ALB itself is having issues Check ALB access logs in S3
# Download recent ALB logs from S3
aws s3 cp s3://<your-alb-logs-bucket>/AWSLogs/<account-id>/elasticloadbalancing/<region>/<date>/ \
  /tmp/alb-logs/ --recursive

# Parse logs for 5xx errors
cat /tmp/alb-logs/*.log | awk '$9 >= 500' | head -20

Layer 4: Ingress Controller

# Check Nginx Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

# Filter for upstream errors
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx | \
  grep -i "error\|upstream\|connect() failed\|no live upstreams"

# Check Ingress rules
kubectl get ingress --all-namespaces
kubectl describe ingress <ingress-name> -n <namespace>

Check TLS certificate expiry — a very common silent failure:

# Check certificate expiry on your domain
echo | openssl s_client -servername api.yourdomain.com \
  -connect api.yourdomain.com:443 2>/dev/null | \
  openssl x509 -noout -dates

# Check cert-manager certificates inside the cluster
kubectl get certificate --all-namespaces
kubectl describe certificate <cert-name> -n <namespace>

# Check certificate expiry date directly
kubectl get secret <tls-secret-name> -n <namespace> -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -enddate

🔴 Critical: TLS certificate expiry causes a complete service outage
with no warning inside Kubernetes. The pod is running, the ingress is running, everything looks fine —
but every HTTPS request fails. Set up cert-manager with automatic renewal and always alert at 30 days
before expiry.

Layer 5: Service Mesh (Istio)

If you are running Istio, a misconfigured VirtualService or DestinationRule can silently route traffic
to the wrong place or drop it entirely.

# Check all sidecars are synced
istioctl proxy-status

# Check Istio config for a specific namespace
istioctl analyze -n <namespace>

# Check Envoy proxy logs for a specific pod
kubectl logs <pod-name> -n <namespace> -c istio-proxy | tail -50

# Check VirtualService and DestinationRule
kubectl get virtualservice --all-namespaces
kubectl get destinationrule --all-namespaces
kubectl describe virtualservice <vs-name> -n <namespace>

# Check mTLS policy
kubectl get peerauthentication --all-namespaces

⚠️ Common Istio Trap: You enable STRICT mTLS globally but a Job or
CronJob runs without a sidecar. Traffic to that pod silently fails. The pod logs show nothing because
the connection never reached the application.

Layer 6: Kubernetes Service and Pod

# Check service endpoints — are they populated?
kubectl get endpoints <service-name> -n <namespace>

# Compare service selector with pod labels
kubectl get service <service-name> -n <namespace> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> --show-labels

# Full pod description
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
kubectl top pod <pod-name> -n <namespace>

# Check for OOMKilled pods
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason=="OOMKilled") | {name:.metadata.name,namespace:.metadata.namespace}'

# Check pod restart count
kubectl get pods --all-namespaces --sort-by='.status.containerStatuses[0].restartCount' | tail -20

# Check probe failures
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Liveness\|Readiness"

Layer 7: Secrets and ConfigMaps

# Check if secrets exist
kubectl get secret --all-namespaces | grep <secret-name>

# Verify secret is mounted in the pod
kubectl exec <pod-name> -n <namespace> -- env | grep -i <env-var-name>

# If using External Secrets Operator
kubectl get externalsecret --all-namespaces
kubectl describe externalsecret <secret-name> -n <namespace>

# Check SSM Parameter Store
aws ssm get-parameter --name "/myapp/prod/db-password" \
  --with-decryption \
  --query 'Parameter.{Name:Name,LastModified:LastModifiedDate,Version:Version}' \
  --output table

⚠️ Note: Secret rotation is one of the most dangerous silent failures.
The secret rotates but your application caches the old value. The app runs fine until a pod restart —
then it picks up the new secret and either works or breaks.

Layer 8: Database / Cache / Queue

# Test DB connectivity from inside the cluster
kubectl run db-debug --image=postgres:15 --restart=Never --rm -it \
  --env="PGPASSWORD=<your-password>" \
  -- psql -h <db-host> -U <user> -d <database> -c "SELECT 1;"

# Check RDS connection count (near max_connections = problem)
aws cloudwatch get-metric-statistics --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=<your-db-id> \
  --start-time $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 --statistics Maximum --output table

# Check max_connections on RDS
aws rds describe-db-parameters --db-parameter-group-name <param-group> \
  --query 'Parameters[?ParameterName==`max_connections`]' --output table

# Check SQS queue depth — is a consumer stuck?
aws sqs get-queue-attributes --queue-url <your-queue-url> \
  --attribute-names ApproximateNumberOfMessages,ApproximateNumberOfMessagesNotVisible \
  --output table

Layer 9: IAM and Network Path

# Check the IAM role on the pod's service account
kubectl get serviceaccount <sa-name> -n <namespace> -o yaml | grep eks.amazonaws.com

# Test IAM permissions from inside the pod
kubectl exec <pod-name> -n <namespace> -- aws sts get-caller-identity
kubectl exec <pod-name> -n <namespace> -- aws s3 ls s3://<your-bucket>/ 2>&1 | head -5

# Check security group inbound rules
aws ec2 describe-security-groups --group-ids <sg-id> \
  --query 'SecurityGroups[*].IpPermissions' --output table

# Check VPC routing
aws ec2 describe-route-tables --filters "Name=vpc-id,Values=<your-vpc-id>" \
  --query 'RouteTables[*].{ID:RouteTableId,Routes:Routes[*].{Dest:DestinationCidrBlock,GW:GatewayId,State:State}}' \
  --output table

# Check NAT Gateway port allocation errors
aws cloudwatch get-metric-statistics --namespace AWS/NATGateway \
  --metric-name ErrorPortAllocation \
  --dimensions Name=NatGatewayId,Value=<nat-gw-id> \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 --statistics Sum --output table

Layer 10: CI/CD Pipeline

# Check GitHub Actions in-progress runs
curl -H "Authorization: token <your-token>" \
  "https://api.github.com/repos/<owner>/<repo>/actions/runs?status=in_progress&per_page=10" | \
  jq '.workflow_runs[] | {name:.name,status:.status,created_at:.created_at}'

# Check ArgoCD sync status
argocd app list
argocd app get <app-name> | grep -A5 "Health Status\|Sync Status"

# Check Helm release history
helm list --all-namespaces
helm history <release-name> -n <namespace>

# Rollback a bad release
helm rollback <release-name> -n <namespace>
helm rollback <release-name> 3 -n <namespace>

Two Real Incidents — Same Principle

Incident 1: “Everything Inside Kubernetes Looks Healthy”

Symptom: New pods couldn’t schedule. Pipelines timing out.

What we checked first (wrong): Pod logs, Helm values, deployment manifests, autoscaler logs.

What was actually wrong: Cloud subscription hit a vCPU quota limit. New EC2 instances couldn’t launch.
The Cluster Autoscaler requested scale-up but the cloud silently failed to provision instances.

The command that found it:

aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name <asg-name> \
  --max-items 5 \
  --output table
StatusCode: Failed
StatusMessage: Your account is currently opted into the Launch Template restriction...

Fix: Requested quota increase, manually freed vCPU headroom from dev environments,
new nodes joined within 10 minutes.

🔴 Time lost: 3 hours. Fix time: 15 minutes.

Incident 2: “The Application Is Slow”

Symptom: P99 latency spiked from 200ms to 4s. No errors — just slow.

What the team checked first (wrong): Application code, recent deployments, GC tuning, thread pool config.

What was actually wrong: The RDS instance had 485 out of 500 max_connections in use.
New requests were queuing inside the connection pool. Application threads were blocked waiting for a DB connection.
The app was completely healthy — it was just starved of database connections.

The command that found it:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=prod-db \
  --start-time $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 --statistics Maximum --output table

Fix:

  • Immediately restarted pods with the highest connection counts (freed connections)
  • Added PgBouncer as a connection pooler in front of RDS
  • Set max_connections per pod in the application config
  • Set RDS Proxy for proper connection pooling long-term

🔴 Time lost: 2 hours. Fix time: 20 minutes.

Cross-Layer Observability Setup

Prometheus Recording Rules for Cross-Layer Signals

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cross-layer-alerts
  namespace: monitoring
spec:
  groups:
    - name: cross-layer
      rules:
        - alert: PodsPendingTooLong
          expr: kube_pod_status_phase{phase="Pending"} == 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} stuck in Pending"
            description: "Check: kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}"

        - alert: PodRestartingFrequently
          expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} restarting frequently"

        - alert: NodeNotReady
          expr: kube_node_status_condition{condition="Ready",status="true"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Node {{ $labels.node }} is not ready"

        - alert: CertificateExpiringSoon
          expr: certmanager_certificate_expiration_timestamp_seconds - time() < 30 * 24 * 60 * 60
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Certificate {{ $labels.name }} expires in less than 30 days"

CloudWatch Composite Alarm for Full-Stack Health

aws cloudwatch put-composite-alarm \
  --alarm-name "ProductionFullStackHealth" \
  --alarm-description "Fires if any layer of the production stack is unhealthy" \
  --alarm-rule "ALARM(ALB-5xx-Rate) OR ALARM(RDS-HighConnections) OR ALARM(ASG-LaunchFailures) OR ALARM(EKS-NodeNotReady)" \
  --alarm-actions <your-sns-topic-arn>

Grafana Dashboard — What to Monitor Per Layer

Panel 1:  DNS         — CoreDNS query error rate
Panel 2:  CDN         — CloudFront 5xx error rate + cache hit ratio
Panel 3:  LB          — ALB target healthy host count + 5xx rate
Panel 4:  Kubernetes  — Pending pods by reason + node count
Panel 5:  Application — Pod restart rate + OOMKilled count
Panel 6:  Database    — RDS connections + query latency
Panel 7:  Cache       — ElastiCache evictions + connection count
Panel 8:  Queue       — SQS message depth + consumer lag
Panel 9:  IAM/Network — VPC flow log drops (CloudWatch Logs Insights)
Panel 10: CI/CD       — Pipeline failure rate + deployment frequency

The Cross-Layer Debugging Runbook

Print this. Pin it to your team’s incident channel. Use it every time.

STEP 1 — DNS
  kubectl run debug --image=busybox --rm -it -- nslookup <service>
  kubectl logs -n kube-system -l k8s-app=kube-dns | grep error

STEP 2 — CDN / WAF
  aws cloudfront list-distributions
  aws wafv2 get-sampled-requests → look for blocked traffic

STEP 3 — Load Balancer
  aws elbv2 describe-target-health --target-group-arn <arn>
  Check: unhealthy targets, health check failures

STEP 4 — Ingress / TLS
  kubectl describe ingress -n <namespace>
  kubectl get certificate -A → check Ready=True and expiry date
  openssl s_client -connect <domain>:443 → check certificate dates

STEP 5 — Service Mesh (if Istio)
  istioctl proxy-status
  istioctl analyze -n <namespace>

STEP 6 — Kubernetes Service and Pod
  kubectl get endpoints <service> -n <ns> → are they populated?
  kubectl describe pod <pod> → check Events
  kubectl logs <pod> --previous → check if it crashed

STEP 7 — Secrets and Config
  kubectl get externalsecret -A → check sync status
  kubectl exec <pod> -- env | grep <key> → verify secret is mounted

STEP 8 — Database / Cache / Queue
  aws cloudwatch → RDS DatabaseConnections (near max_connections?)
  aws cloudwatch → ElastiCache CurrConnections
  aws sqs get-queue-attributes → is message depth growing?

STEP 9 — IAM and Network
  kubectl exec <pod> -- aws sts get-caller-identity
  aws ec2 describe-security-groups → check inbound rules
  VPC Flow Logs → look for REJECT entries

STEP 10 — CI/CD
  Check pipeline queue time (GitHub Actions / ArgoCD)
  helm history <release> → did a recent deployment fail silently?
  helm rollback <release> -n <namespace>

My Rule During Any Incident

Every time I start debugging a production failure, I follow the same five steps:

  1. Do not stop at the error message.
    The error message tells you where the failure surfaced. It rarely tells you where it started.
  2. Trace the path.
    Walk the full chain from user to database. Ask: at which layer did the request fail? Check each one.
  3. Check the last change.
    What was deployed in the last 24 hours? What changed in infra? Secret rotation? Config update?
    IAM policy change? 90% of incidents are caused by a recent change.

    # Check recent Kubernetes events
    kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -30
    
    # Check recent Helm deployments
    helm history <release> -n <namespace>
    
    # Check recent AWS config changes
    aws configservice get-resource-config-history \
      --resource-type AWS::EKS::Cluster \
      --resource-id <cluster-name> \
      --limit 5
  4. Validate the dependency.
    Is the database reachable? Is the cache responding? Is the queue draining? Is the IAM role valid?
    Don’t assume — test.
  5. Confirm the capacity.
    Is there enough: CPU? Memory? DB connections? vCPU quota? Pods? Nodes? NAT Gateway ports?
    Pick the one that’s relevant and check it explicitly.

Key Takeaways

The best DevOps teams are not the ones that react the fastest. They are the ones that have built the
right signals at every layer so they never spend 3 hours debugging the wrong thing.

Build observability across every layer of the chain — not just the application.

Silent failures are dangerous precisely because no individual layer screams for help. The application
layer looks fine. The cluster looks fine. The pipeline looks fine. Only when you trace the full path does
the gap become visible.

Your goal is not to have a dashboard for everything. Your goal is to have the right alert fire at the
right layer before a user ever notices.

What Is the Sneakiest Hidden Failure You Have Debugged?

After 10 years of production incidents, I have learned that the most expensive bugs are not the loud
ones. They are the ones where every dashboard looks “almost healthy.”

Drop your hidden failure story in the comments — what layer was nobody watching?


Tags:
#Kubernetes   #DevOps   #AWS   #SRE  
#Observability   #IncidentManagement   #CloudArchitecture  
#Debugging   #ProductionSupport   #EKS   #Istio  
#Prometheus   #Grafana

Leave a Comment