Kubernetes Network Troubleshooting Guide
Kubernetes Network Troubleshooting Guide
Network issues are the #1 problem in production Kubernetes. This guide covers real-world scenarios and debugging techniques.
Common Network Issues Overview
- Service Discovery - “Can’t connect to service”
- DNS Resolution - “Cannot resolve hostname”
- Network Policies - “Connection blocked”
- Load Balancer - “External access not working”
- Pod-to-Pod - “Pods can’t talk to each other”
Issue 1: Service Discovery Problems
Symptoms:
- “Connection refused”
- “No such host”
- “Unable to connect to service”
Step 1: Verify Service Exists
# Check if service exists
kubectl get svc -n healthcare-dev
# Check service details
kubectl describe svc healthcare-demo-dev -n healthcare-dev
Step 2: Check Endpoints
# Are there any pods backing the service?
kubectl get endpoints healthcare-demo-dev -n healthcare-dev
# Output should show IP addresses:
# NAME ENDPOINTS AGE
# healthcare-demo-dev 10.244.1.5:80,10.244.1.5:443 5m
**If ENDPOINTS is empty or
Step 3: Verify Selector Matches
# Check service selector
kubectl get svc healthcare-demo-dev -n healthcare-dev -o yaml | grep -A5 selector
# Check if pods have matching labels
kubectl get pods -n healthcare-dev --show-labels
Solution:
# Service selector must match pod labels exactly
selector:
app.kubernetes.io/name: healthcare-demo
app.kubernetes.io/instance: healthcare-demo-dev
Issue 2: DNS Resolution Failures
Symptoms:
- “Name or service not known”
- “Temporary failure in name resolution”
- “Could not resolve host”
Step 1: Test DNS from Inside a Pod
# Launch a debug pod
kubectl run -it --rm debug --image=busybox --restart=Never -n healthcare-dev -- sh
# Inside the pod, test DNS
nslookup healthcare-demo-dev
nslookup healthcare-demo-dev.healthcare-dev.svc.cluster.local
nslookup kubernetes.default
Step 2: Check CoreDNS
# Check if CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns
Step 3: Test Different DNS Formats
# Short name (works only within same namespace)
curl http://healthcare-demo-dev
# Namespace qualified
curl http://healthcare-demo-dev.healthcare-dev
# Fully qualified domain name (FQDN)
curl http://healthcare-demo-dev.healthcare-dev.svc.cluster.local
Common DNS Issues:
- Wrong service name - It’s case-sensitive!
- Wrong namespace - Services in different namespaces need namespace in URL
- CoreDNS down - Check kube-system namespace
Issue 3: Connection Timeouts
Symptoms:
- Connection hangs
- “Operation timed out”
- Works sometimes, fails sometimes
Step 1: Test Connectivity
# From your local machine
kubectl port-forward svc/healthcare-demo-dev 8080:80 -n healthcare-dev
# Test
curl -v http://localhost:8080
# From inside cluster
kubectl run test-curl --rm -it --image=curlimages/curl -- sh
curl -v http://healthcare-demo-dev.healthcare-dev:80
Step 2: Check Network Policies
# List network policies
kubectl get networkpolicies -n healthcare-dev
# If any exist, check if they're blocking traffic
kubectl describe networkpolicy <policy-name> -n healthcare-dev
Step 3: Trace the Network Path
# Get pod IPs
kubectl get pods -n healthcare-dev -o wide
# Test direct pod-to-pod connectivity
kubectl exec -it <source-pod> -n healthcare-dev -- curl <target-pod-ip>:80
Issue 4: Intermittent Connection Failures
Symptoms:
- Works 50% of the time
- Random “connection refused”
- Load balancing issues
Debugging Steps:
# Check if all pod replicas are healthy
kubectl get pods -n healthcare-dev
kubectl describe pods -n healthcare-dev | grep -A10 "Conditions:"
# Watch endpoints in real-time
kubectl get endpoints healthcare-demo-dev -n healthcare-dev -w
# Test multiple times to see pattern
for i in {1..10}; do
kubectl exec -it <test-pod> -- curl -s -o /dev/null -w "%{http_code}\n" http://healthcare-demo-dev
done
Common Causes:
- Failing health checks - Pods removed from service
- Resource limits - Pods getting OOMKilled
- Network policies - Partial blocking
Issue 5: External Access Problems
Symptoms:
- Can’t access from outside cluster
- LoadBalancer stuck in “Pending”
- Ingress not working
For LoadBalancer Services:
# Check LoadBalancer status
kubectl get svc -n healthcare-dev
# If EXTERNAL-IP is <pending>
kubectl describe svc healthcare-demo-dev -n healthcare-dev
# Look for events explaining why
For Ingress:
# Check ingress
kubectl get ingress -n healthcare-dev
kubectl describe ingress <ingress-name> -n healthcare-dev
# Check ingress controller
kubectl get pods -n ingress-nginx
kubectl logs -n ingress-nginx <ingress-controller-pod>
Network Debugging Tools
1. Network Debugging Pod
# Create a debug pod with network tools
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- bash
# Tools available:
# - ping, curl, wget
# - dig, nslookup, host
# - tcpdump, netstat, ss
# - iptables, ipvsadm
2. TCP/UDP Connectivity Test
# Test TCP port
kubectl exec -it <pod> -- nc -zv <target-host> <port>
# Test with timeout
kubectl exec -it <pod> -- timeout 5 bash -c "cat < /dev/null > /dev/tcp/<host>/<port>"
3. DNS Debugging
# Detailed DNS query
kubectl exec -it <pod> -- dig +search +short healthcare-demo-dev
# Show full DNS resolution
kubectl exec -it <pod> -- dig +trace healthcare-demo-dev.healthcare-dev.svc.cluster.local
4. Service Discovery Debug
# See how service discovery works
kubectl exec -it <pod> -- env | grep -i service
# Each service creates env vars:
# HEALTHCARE_DEMO_DEV_SERVICE_HOST=10.96.123.45
# HEALTHCARE_DEMO_DEV_SERVICE_PORT=80
Real-World Debugging Workflow
Scenario: “MySQL Connection Refused”
# 1. Check if MySQL is running
kubectl get pods -n healthcare-dev
# healthcare-demo-dev-xxx 3/4 Running (MySQL container not ready)
# 2. Check MySQL logs
kubectl logs <pod> -n healthcare-dev -c mysql
# 3. Test connection from PHP container
kubectl exec -it <pod> -n healthcare-dev -c php-fpm -- mysql -h 127.0.0.1 -u root -p
# 4. Check if it's a localhost issue (common in multi-container pods)
kubectl exec -it <pod> -n healthcare-dev -c php-fpm -- nc -zv 127.0.0.1 3306
kubectl exec -it <pod> -n healthcare-dev -c php-fpm -- nc -zv localhost 3306
Scenario: “Service Unavailable from Another Namespace”
# 1. Try FQDN
kubectl run test --rm -it --image=busybox -n other-namespace -- wget -O- http://healthcare-demo-dev.healthcare-dev.svc.cluster.local
# 2. Check NetworkPolicies
kubectl get networkpolicies --all-namespaces
# 3. Test from same namespace (to isolate issue)
kubectl run test --rm -it --image=busybox -n healthcare-dev -- wget -O- http://healthcare-demo-dev
Quick Network Debugging Checklist
- Service exists? →
kubectl get svc - Endpoints populated? →
kubectl get endpoints - Labels match? →
kubectl get pods --show-labels - DNS working? →
nslookupfrom inside pod - Port correct? → Check service and pod ports
- Network policy? →
kubectl get networkpolicies - Firewall/Security groups? → Cloud provider specific
Common Fixes
Fix 1: Service Selector Mismatch
# Check pod labels
kubectl get pods --show-labels
# Update service selector to match
kubectl edit svc healthcare-demo-dev
Fix 2: DNS Not Resolving
# Restart CoreDNS
kubectl rollout restart deployment coredns -n kube-system
# Or delete pods to force restart
kubectl delete pods -n kube-system -l k8s-app=kube-dns
Fix 3: Connection Refused
# Usually means:
# 1. Service port wrong
# 2. Container not listening on expected port
# 3. Application crashed
# Verify port
kubectl get svc healthcare-demo-dev -o yaml | grep port
kubectl exec -it <pod> -- netstat -tlnp
Interview Answer Template
When asked about network issues, structure your answer:
- Identify the layer
- “Is this a DNS issue, service discovery, or network policy?”
- Gather information
- “I’ll check services, endpoints, and pod status”
- Test connectivity
- “I’ll test from inside a pod to isolate the issue”
- Check configurations
- “I’ll verify selectors, ports, and policies”
- Apply fix
- “Based on the findings, I’ll update the configuration”
Example:
“I see connection refused. First, I’ll check if the service exists with
kubectl get svc. Then verify endpoints are populated withkubectl get endpoints. If empty, I’ll check pod labels match the service selector…”
Pro Tips
- Always test from inside the cluster - External issues are different from internal
- Use FQDN for cross-namespace - Short names only work within namespace
- Check both TCP and UDP - Some services use UDP (DNS, game servers)
- Remember localhost in multi-container pods - Containers share network namespace
- NetworkPolicies are deny-by-default - Once you add one, all other traffic is blocked
This guide covers 90% of network issues you’ll see in production Kubernetes!