Kubernetes Real-World Troubleshooting Guide
Kubernetes Troubleshooting Guide - Real World Scenarios
This guide documents actual issues encountered in production and how to diagnose/fix them. These are common problems you’ll face running applications in Kubernetes.
Issue 1: PHP Redis Extension Build Failure (ARM64)
Symptoms:
ERROR: failed to solve: process "/bin/sh -c pecl install redis && docker-php-ext-enable redis" did not complete successfully: exit code: 1
Diagnosis Commands:
# Check your architecture
uname -m # Shows arm64 on M1/M2 Macs
# Check Docker build logs
docker build --no-cache -f docker/php-fpm/Dockerfile .
Root Cause:
PECL extensions sometimes fail to compile on ARM64 without proper build tools.
Solution:
Add build dependencies and use specific version:
# Add build tools
RUN apk add --no-cache \
autoconf \
g++ \
make \
linux-headers
# Install Redis with proper cleanup
RUN apk add --no-cache --virtual .build-deps $PHPIZE_DEPS \
&& pecl install redis-5.3.7 \
&& docker-php-ext-enable redis \
&& apk del .build-deps
Issue 2: MySQL Container Permission Denied
Symptoms:
mkdir: cannot create directory '/var/log/mysql': Permission denied
Diagnosis Commands:
# Check pod status
kubectl get pods -n healthcare-dev
# Check container logs
kubectl logs <pod-name> -n healthcare-dev -c mysql
# Check security context
kubectl get pod <pod-name> -n healthcare-dev -o yaml | grep -A10 securityContext
Root Cause:
MySQL container was forced to run as user 1001, but MySQL needs to run as the mysql user.
Solution:
- Remove custom securityContext from MySQL in values.yaml
- Let MySQL use its default user:
```yaml
Before (wrong):
mysql: securityContext: runAsUser: 1001
After (correct):
mysql: # No securityContext - use container defaults
---
## Issue 3: MySQL Startup Failure - Invalid Configuration
### Symptoms:
[ERROR] [MY-000067] [Server] unknown variable ‘query_cache_type=1’. [ERROR] [MY-010119] [Server] Aborting
### Diagnosis Commands:
```bash
# Get detailed logs from crashed container
kubectl logs <pod-name> -n healthcare-dev -c mysql --previous
# Check ConfigMap
kubectl get configmap <configmap-name> -n healthcare-dev -o yaml
Root Cause:
MySQL 8.0 removed query cache, but our config still had it.
Solution:
Remove deprecated options from ConfigMap:
# Remove these lines:
# query_cache_type = 1
# query_cache_size = 32M
Alternative Quick Fix:
Disable custom MySQL config temporarily:
volumeMounts:
- name: mysql-data
mountPath: /var/lib/mysql
# Comment out custom config
# - name: mysql-config
# mountPath: /etc/mysql/conf.d
Issue 4: “Primary Script Unknown” - PHP Files Not Found
Symptoms:
FastCGI sent in stderr: "Primary script unknown"
Diagnosis Commands:
# Check if files exist in container
kubectl exec -it <pod-name> -n healthcare-dev -c php-fpm -- ls -la /var/www/html/
kubectl exec -it <pod-name> -n healthcare-dev -c php-fpm -- ls -la /var/www/html/public/
# Check volume mounts
kubectl describe pod <pod-name> -n healthcare-dev | grep -A20 "Mounts:"
Root Cause:
Application files were copied during Docker build, but then overwritten by empty volume mount.
Solution:
Remove volume mount that overwrites application code:
# Before:
volumeMounts:
- name: app-content
mountPath: /var/www/html # This overwrites our COPY!
# After:
volumeMounts:
# app-content removed - files are in the image
Key Learning:
- Volume mounts OVERRIDE anything at that path from the image
- Use volumes for data that changes, not for static code
Issue 5: PHP Extensions Not Found (PDO MySQL, Redis)
Symptoms:
Database connection failed: could not find driver
PHP Fatal error: Class "Redis" not found
Diagnosis Commands:
# Check loaded PHP extensions
kubectl exec -it <pod-name> -n healthcare-dev -c php-fpm -- php -m
# Check specific extensions
kubectl exec -it <pod-name> -n healthcare-dev -c php-fpm -- php -m | grep -E "pdo|redis"
# Test PHP configuration
kubectl exec -it <pod-name> -n healthcare-dev -c php-fpm -- php -i | grep -E "pdo|redis"
Root Cause:
PHP namespace issue - classes were being looked for in wrong namespace.
Solution:
Use global namespace for PHP built-in classes:
// Before:
use Redis;
use PDO;
// After:
use \Redis; // Note the backslash
use \PDO; // This means "global namespace"
General Kubernetes Debugging Commands
1. Pod Not Starting
# Get pod status
kubectl get pods -n <namespace>
# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>
# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
2. Container Crashing
# Current logs
kubectl logs <pod-name> -n <namespace> -c <container-name>
# Previous crashed container logs
kubectl logs <pod-name> -n <namespace> -c <container-name> --previous
# Follow logs in real-time
kubectl logs -f <pod-name> -n <namespace> -c <container-name>
3. Exec Into Container
# Bash shell
kubectl exec -it <pod-name> -n <namespace> -c <container-name> -- /bin/bash
# Sh shell (for Alpine)
kubectl exec -it <pod-name> -n <namespace> -c <container-name> -- /bin/sh
# Run single command
kubectl exec <pod-name> -n <namespace> -c <container-name> -- <command>
4. Resource Issues
# Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes
# Check resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Limits:"
5. Configuration Issues
# Check ConfigMaps
kubectl get configmaps -n <namespace>
kubectl describe configmap <name> -n <namespace>
# Check Secrets
kubectl get secrets -n <namespace>
kubectl get secret <name> -n <namespace> -o yaml
# Decode secret
kubectl get secret <name> -n <namespace> -o jsonpath='{.data.<key>}' | base64 -d
6. Networking Issues
# Check services
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>
# Check endpoints
kubectl get endpoints -n <namespace>
# Test connectivity from pod
kubectl exec -it <pod-name> -n <namespace> -- curl <service-name>:<port>
Debugging Workflow for Interviews
When presented with a failing pod, follow this sequence:
- Check Pod Status
kubectl get pods -n <namespace> # Look for: Pending, CrashLoopBackOff, Error, etc. - Get More Details
kubectl describe pod <pod-name> -n <namespace> # Look for: Events, Failed mounts, Image pulls, etc. - Check Logs
kubectl logs <pod-name> -n <namespace> # If multi-container pod: kubectl logs <pod-name> -n <namespace> -c <container-name> - Check Previous Logs (if crashing)
kubectl logs <pod-name> -n <namespace> --previous - Exec Into Container (if running)
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh # Then check files, test connections, etc. - Check Resources
kubectl top pods -n <namespace> # Check if hitting resource limits - Review Configuration
# ConfigMaps kubectl get cm -n <namespace> # Secrets kubectl get secrets -n <namespace> # Volume mounts kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A20 volumeMounts
Common Patterns and Solutions
Pattern 1: CrashLoopBackOff
- Check: Previous logs
- Common causes: Bad config, missing dependencies, permissions
Pattern 2: ImagePullBackOff
- Check: Image name, registry credentials
- Fix: Correct image tag, add imagePullSecrets
Pattern 3: Pending Pod
- Check: Events, node resources
- Common causes: No nodes available, PVC not bound
Pattern 4: “Permission Denied”
- Check: Security context, file ownership
- Fix: Adjust securityContext or file permissions
Pattern 5: “Connection Refused”
- Check: Service names, ports, network policies
- Fix: Verify service discovery, check firewall rules
Interview Tips
- Always check logs first - They usually tell you exactly what’s wrong
- Use –previous for crashed containers - Current logs might be empty
- Describe pod shows events - Great for mount/pull/scheduling issues
- Exec is powerful - Test connections, check files, run diagnostics
- Volume mounts override image content - Common gotcha
- Security contexts affect permissions - Each container can have different user
Quick Reference Card
# The Holy Trinity of Debugging
kubectl get pods -n <namespace>
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace> [-c <container>] [--previous]
# Interactive Debugging
kubectl exec -it <pod> -n <namespace> [-c <container>] -- /bin/sh
# Configuration Check
kubectl get cm,secret,svc,ep -n <namespace>
# Resource Check
kubectl top pods -n <namespace>
Systematic Debugging Approach
Remember: In interviews, verbalize your debugging process. Say things like:
- “First, I’ll check the pod status…”
- “The logs show a permission error, so I’ll check the security context…”
- “This looks like a configuration issue, let me check the ConfigMap…”
This shows systematic thinking, which is what interviewers want to see!
The 5-Step Debug Method:
- Observe - What’s the actual symptom?
- Gather - Collect logs, events, configs
- Hypothesize - What could cause this?
- Test - Validate your hypothesis
- Fix - Apply the solution and verify
Example Application:
Symptom: Pod in CrashLoopBackOff
- Observe:
kubectl get podsshows CrashLoopBackOff - Gather:
kubectl logs pod --previousshows “Permission denied on /data” - Hypothesize: Security context or volume permissions issue
- Test:
kubectl describe podshows runAsUser: 1001, but volume owned by root - Fix: Update securityContext or initContainer to fix permissions
Pro Tips for Production
- Always use –previous for crashed pods - Current container might have just started
- Events are time-sensitive - They expire, so check early
- Multi-container pods need -c flag - Specify which container’s logs
- Resource limits cause OOMKilled - Check memory limits vs actual usage
- InitContainers fail differently - Check them separately if pod stuck initializing
- Volume mount issues prevent pod start - Check PVC status and mount paths
- Image pull errors are common - Verify image exists and credentials are correct
This guide covers the most common real-world Kubernetes issues you’ll encounter in production environments!