Kubernetes Troubleshooter & Incident Response Systematic approach to diagnosing and resolving Kubernetes issues in production environments. Core Troubleshooting Workflow Follow this systematic approach for any Kubernetes issue: 1. Gather Context - What is the observed symptom? - When did it start? - What changed recently (deployments, config, infrastructure)? - What is the scope (single pod, service, node, cluster)? - What is the business impact (severity level)? 2. Initial Triage Run cluster health check: This provides an overview of: - Node health status - Pending and failed pods across all…

\\t'\n```\n\n**Resolution:**\n- Fix YAML syntax errors\n- Replace tabs with spaces\n- Ensure proper indentation\n- Quote special characters in strings\n\n### Secret Values Not Working\n\n**Symptoms:**\n- Secrets not created or contain wrong values\n- Base64 encoding issues\n\n**Investigation:**\n```bash\n# Check secret in manifest\nhelm get manifest \u003crelease-name> -n \u003cnamespace> | grep -A 10 \"kind: Secret\"\n\n# Decode secret\nkubectl get secret \u003csecret-name> -n \u003cnamespace> -o json | \\\n jq '.data | map_values(@base64d)'\n```\n\n**Resolution:**\n```yaml\n# Use proper secret format in values.yaml\nsecrets:\n password: \"mySecretPassword\" # Helm will base64 encode\n\n# Or pre-encode if template expects it\nsecrets:\n password: \"bXlTZWNyZXRQYXNzd29yZA==\" # Already base64 encoded\n```\n\n---\n\n## Chart Dependencies\n\n### Dependency Update Fails\n\n**Symptoms:**\n```\nError: An error occurred while checking for chart dependencies\n```\n\n**Investigation:**\n```bash\n# Check Chart.yaml dependencies\ncat Chart.yaml\n\n# List current dependencies\nhelm dependency list \u003cchart-directory>\n\n# Check repository access\nhelm repo list\nhelm repo update\n```\n\n**Resolution:**\n```bash\n# Update dependencies\nhelm dependency update \u003cchart-directory>\n\n# Build dependencies (downloads to charts/)\nhelm dependency build \u003cchart-directory>\n\n# Add missing repositories\nhelm repo add \u003crepo-name> \u003crepo-url>\nhelm repo update\n```\n\n### Dependency Version Conflicts\n\n**Symptoms:**\n```\nError: found in Chart.yaml, but missing in charts/ directory\n```\n\n**Resolution:**\n```bash\n# Clean dependencies\nrm -rf \u003cchart-directory>/charts/*\nrm -f \u003cchart-directory>/Chart.lock\n\n# Rebuild\nhelm dependency update \u003cchart-directory>\n\n# Verify\nhelm dependency list \u003cchart-directory>\n```\n\n### Subchart Values Not Applied\n\n**Investigation:**\n```bash\n# Check subchart values in parent chart\ncat values.yaml | grep -A 20 \u003csubchart-name>\n\n# Render to see what values subchart receives\nhelm template \u003crelease-name> \u003cchart> -f values.yaml | grep -A 50 \"# Source: \u003csubchart-name>\"\n```\n\n**Resolution:**\n```yaml\n# In parent chart's values.yaml, nest subchart values under subchart name:\npostgresql: # Subchart name\n auth:\n username: myuser\n password: mypass\n database: mydb\n primary:\n resources:\n requests:\n memory: \"256Mi\"\n cpu: \"250m\"\n```\n\n---\n\n## Hooks and Lifecycle\n\n### Pre/Post Hooks Failing\n\n**Symptoms:**\n- Installation/upgrade hangs waiting for hooks\n- Hook jobs fail\n- Release stuck in pending state\n\n**Investigation:**\n```bash\n# List hooks\nkubectl get jobs -n \u003cnamespace> -l \"helm.sh/hook\"\n\n# Check hook status\nkubectl describe job \u003chook-job-name> -n \u003cnamespace>\n\n# Get hook logs\nkubectl logs -n \u003cnamespace> -l \"helm.sh/hook=pre-install\"\nkubectl logs -n \u003cnamespace> -l \"helm.sh/hook=post-install\"\n```\n\n**Resolution:**\n```bash\n# Delete failed hooks\nkubectl delete job -n \u003cnamespace> -l \"helm.sh/hook\"\n\n# Retry without hooks\nhelm upgrade \u003crelease-name> \u003cchart> -n \u003cnamespace> --no-hooks\n\n# Or skip hooks during install\nhelm install \u003crelease-name> \u003cchart> -n \u003cnamespace> --no-hooks\n```\n\n### Hook Cleanup Issues\n\n**Symptoms:**\n- Hook resources remain after installation\n- Accumulating failed hook jobs\n\n**Investigation:**\n```bash\n# Check hook deletion policy\nhelm get manifest \u003crelease-name> -n \u003cnamespace> | grep -B 5 \"helm.sh/hook-delete-policy\"\n\n# List remaining hooks\nkubectl get all -n \u003cnamespace> -l \"helm.sh/hook\"\n```\n\n**Resolution:**\n```bash\n# Manual cleanup\nkubectl delete jobs,pods -n \u003cnamespace> -l \"helm.sh/hook\"\n\n# Update chart template to include proper hook-delete-policy:\n# metadata:\n# annotations:\n# \"helm.sh/hook\": pre-install\n# \"helm.sh/hook-delete-policy\": hook-succeeded,hook-failed\n```\n\n---\n\n## Repository Issues\n\n### Unable to Get Chart from Repository\n\n**Symptoms:**\n```\nError: failed to download \"\u003cchart-name>\"\n```\n\n**Investigation:**\n```bash\n# Check repository configuration\nhelm repo list\n\n# Update repositories\nhelm repo update\n\n# Search for chart\nhelm search repo \u003cchart-name> --versions\n\n# Test repository access\ncurl -I \u003crepo-url>/index.yaml\n```\n\n**Resolution:**\n```bash\n# Remove and re-add repository\nhelm repo remove \u003crepo-name>\nhelm repo add \u003crepo-name> \u003crepo-url>\nhelm repo update\n\n# For private repos, configure credentials\nhelm repo add \u003crepo-name> \u003crepo-url> \\\n --username=\u003cusername> \\\n --password=\u003cpassword>\n\n# Or use OCI registry\nhelm pull oci://registry.example.com/charts/\u003cchart-name> --version 1.0.0\n```\n\n### Chart Version Not Found\n\n**Symptoms:**\n```\nError: chart \"\u003cchart-name>\" version \"1.2.3\" not found\n```\n\n**Investigation:**\n```bash\n# List available versions\nhelm search repo \u003cchart-name> --versions\n\n# Check if specific version exists\nhelm show chart \u003crepo-name>/\u003cchart-name> --version 1.2.3\n```\n\n**Resolution:**\n```bash\n# Use available version\nhelm install \u003crelease-name> \u003crepo-name>/\u003cchart-name> --version \u003cavailable-version>\n\n# Or use latest\nhelm install \u003crelease-name> \u003crepo-name>/\u003cchart-name>\n```\n\n---\n\n## Debugging Tools and Commands\n\n### Essential Helm Commands\n\n```bash\n# Get release information\nhelm list -n \u003cnamespace> --all\nhelm status \u003crelease-name> -n \u003cnamespace>\nhelm history \u003crelease-name> -n \u003cnamespace>\n\n# Get release content\nhelm get values \u003crelease-name> -n \u003cnamespace>\nhelm get manifest \u003crelease-name> -n \u003cnamespace>\nhelm get hooks \u003crelease-name> -n \u003cnamespace>\nhelm get notes \u003crelease-name> -n \u003cnamespace>\n\n# Debugging\nhelm install \u003crelease-name> \u003cchart> --debug --dry-run -n \u003cnamespace>\nhelm template \u003crelease-name> \u003cchart> --debug -n \u003cnamespace>\n\n# Testing\nhelm test \u003crelease-name> -n \u003cnamespace>\nhelm lint \u003cchart-directory>\n```\n\n### Useful Plugins\n\n```bash\n# Install helm-diff plugin\nhelm plugin install https://github.com/databus23/helm-diff\n\n# Compare releases\nhelm diff upgrade \u003crelease-name> \u003cchart> -n \u003cnamespace>\n\n# Install helm-secrets plugin\nhelm plugin install https://github.com/jkroepke/helm-secrets\n\n# Use encrypted values\nhelm secrets install \u003crelease-name> \u003cchart> -f secrets.yaml -n \u003cnamespace>\n```\n\n### Helm Environment Issues\n\n**Check Helm configuration:**\n```bash\n# Helm version\nhelm version\n\n# Kubernetes context\nkubectl config current-context\n\n# Helm environment\nhelm env\n\n# Cache location\nhelm env | grep CACHE\n```\n\n---\n\n## Best Practices\n\n### Release Management\n- Use descriptive release names\n- Always specify namespace explicitly\n- Use `--atomic` flag for safer upgrades (rolls back on failure)\n- Keep release history manageable: `helm history \u003crelease> -n \u003cnamespace> --max 10`\n\n### Values Management\n- Use multiple values files for different environments\n- Version control your values files\n- Use `helm template` to preview changes before applying\n- Document required values in chart README\n\n### Chart Development\n- Always run `helm lint` before packaging\n- Test charts in multiple environments\n- Use semantic versioning for charts\n- Implement proper hooks with deletion policies\n\n### Troubleshooting Workflow\n1. Check release status: `helm status \u003crelease> -n \u003cnamespace>`\n2. Check history: `helm history \u003crelease> -n \u003cnamespace>`\n3. Get values: `helm get values \u003crelease> -n \u003cnamespace>`\n4. Check manifest: `helm get manifest \u003crelease> -n \u003cnamespace>`\n5. Check Kubernetes events: `kubectl get events -n \u003cnamespace>`\n6. Check pod logs: `kubectl logs \u003cpod> -n \u003cnamespace>`\n7. Check hooks: `kubectl get jobs -n \u003cnamespace> -l helm.sh/hook`\n\n---\n\n## Quick Reference\n\n### Common Flags\n\n```bash\n# Installation/Upgrade\n--atomic # Rollback on failure\n--wait # Wait for resources to be ready\n--timeout 10m # Set timeout (default 5m)\n--force # Force update by deleting and recreating resources\n--cleanup-on-fail # Delete resources on failed install\n\n# Debugging\n--debug # Enable verbose output\n--dry-run # Simulate operation\n--no-hooks # Skip hooks\n\n# Values\n-f values.yaml # Use values file\n--set key=value # Set value via command line\n--reuse-values # Reuse values from previous release\n```\n\n### Typical Rescue Commands\n\n```bash\n# Release stuck? Force delete and reinstall\nhelm uninstall \u003crelease> -n \u003cnamespace> --no-hooks\nkubectl delete secret -n \u003cnamespace> -l owner=helm,name=\u003crelease>\nhelm install \u003crelease> \u003cchart> -n \u003cnamespace> -f values.yaml\n\n# Upgrade failed? Rollback\nhelm rollback \u003crelease> 0 -n \u003cnamespace> # 0 = previous revision\n\n# Can't rollback? Force upgrade\nhelm upgrade \u003crelease> \u003cchart> -n \u003cnamespace> --force --recreate-pods\n\n# Complete cleanup\nhelm uninstall \u003crelease> -n \u003cnamespace>\nkubectl delete namespace \u003cnamespace> # If dedicated namespace\n```\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":16570,"content_sha256":"ff7f272643a36106c832321eff0f2665b2752f4417c4688823ebbdce857e0cb4"},{"filename":"references/incident_response.md","content":"# Kubernetes Incident Response Playbook\n\nThis playbook provides structured procedures for responding to Kubernetes incidents.\n\n## Incident Response Framework\n\n### 1. Detection Phase\n- Identify the incident (alerts, user reports, monitoring)\n- Determine severity level\n- Initiate incident response\n\n### 2. Triage Phase\n- Assess impact and scope\n- Gather initial diagnostic data\n- Determine if immediate action needed\n\n### 3. Investigation Phase\n- Collect comprehensive diagnostics\n- Identify root cause\n- Document findings\n\n### 4. Resolution Phase\n- Apply remediation\n- Verify fix\n- Monitor for recurrence\n\n### 5. Post-Incident Phase\n- Document incident\n- Conduct blameless post-mortem\n- Implement preventive measures\n\n---\n\n## Severity Levels\n\n### SEV-1: Critical\n- Complete service outage\n- Data loss or corruption\n- Security breach\n- Impact: All users affected\n- Response: Immediate, all-hands\n\n### SEV-2: High\n- Major functionality degraded\n- Significant performance impact\n- Impact: Large subset of users\n- Response: Within 15 minutes\n\n### SEV-3: Medium\n- Minor functionality impaired\n- Workaround available\n- Impact: Some users affected\n- Response: Within 1 hour\n\n### SEV-4: Low\n- Cosmetic issues\n- Negligible impact\n- Impact: Minimal\n- Response: During business hours\n\n---\n\n## Common Incident Scenarios\n\n### Scenario 1: Complete Cluster Outage\n\n**Symptoms:**\n- All services unreachable\n- kubectl commands timing out\n- API server not responding\n\n**Immediate Actions:**\n1. Verify the scope (single cluster or multi-cluster)\n2. Check API server status and logs\n3. Check control plane nodes\n4. Verify network connectivity to control plane\n5. Check etcd cluster health\n\n**Investigation Steps:**\n```bash\n# Check control plane pods\nkubectl get pods -n kube-system\n\n# Check etcd\nkubectl exec -it etcd-\u003cnode> -n kube-system -- etcdctl endpoint health\n\n# Check API server logs\njournalctl -u kube-apiserver -n 100\n\n# Check control plane node resources\nssh \u003ccontrol-plane-node> \"top\"\n```\n\n**Common Causes:**\n- etcd cluster failure\n- API server OOM/crash\n- Control plane network partition\n- Certificate expiration\n- Cloud provider outage\n\n**Resolution Paths:**\n1. etcd issue: Restore from backup or rebuild cluster\n2. API server issue: Restart API server pods/service\n3. Network: Fix routing, security groups, or DNS\n4. Certificates: Renew certificates (kubeadm cert renew all)\n\n---\n\n### Scenario 2: Service Degradation\n\n**Symptoms:**\n- Increased latency or error rates\n- Some requests failing\n- Intermittent issues\n\n**Immediate Actions:**\n1. Check service metrics and logs\n2. Verify pod health and count\n3. Check for recent deployments\n4. Review resource utilization\n\n**Investigation Steps:**\n```bash\n# Check service endpoints\nkubectl get endpoints \u003cservice> -n \u003cnamespace>\n\n# Check pod status\nkubectl get pods -l \u003cservice-selector> -n \u003cnamespace>\n\n# Review recent changes\nkubectl rollout history deployment/\u003cname> -n \u003cnamespace>\n\n# Check resource usage\nkubectl top pods -n \u003cnamespace>\n\n# Get recent events\nkubectl get events -n \u003cnamespace> --sort-by='.lastTimestamp'\n```\n\n**Common Causes:**\n- Insufficient replicas\n- Pod restarts/crashes\n- Resource contention\n- Bad deployment\n- External dependency failure\n\n**Resolution Paths:**\n1. Scale up replicas if needed\n2. Rollback bad deployment\n3. Increase resources if constrained\n4. Fix configuration issues\n5. Implement circuit breaker for external deps\n\n---\n\n### Scenario 3: Node Failure\n\n**Symptoms:**\n- Node reported as NotReady\n- Pods being evicted from node\n- High node resource utilization\n\n**Immediate Actions:**\n1. Identify affected node\n2. Check impact (which pods running on node)\n3. Determine if pods need immediate migration\n4. Assess if node is recoverable\n\n**Investigation Steps:**\n```bash\n# Get node status\nkubectl get nodes\n\n# Describe the problem node\nkubectl describe node \u003cnode-name>\n\n# Check pods on the node\nkubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=\u003cnode-name>\n\n# SSH to node and check\nssh \u003cnode> \"systemctl status kubelet\"\nssh \u003cnode> \"journalctl -u kubelet -n 100\"\nssh \u003cnode> \"docker ps\" # or containerd\nssh \u003cnode> \"df -h\"\nssh \u003cnode> \"free -m\"\n```\n\n**Common Causes:**\n- Kubelet failure\n- Disk full\n- Memory exhaustion\n- Network issues\n- Hardware failure\n\n**Resolution Paths:**\n1. Recoverable: Fix issue (clean disk, restart services)\n2. Not recoverable: Cordon, drain, and replace node\n3. For critical pods: Manually reschedule if necessary\n4. Update monitoring and alerting based on findings\n\n---\n\n### Scenario 4: Storage Issues\n\n**Symptoms:**\n- PVCs stuck in Pending\n- Pods can't start due to volume issues\n- Data access failures\n\n**Immediate Actions:**\n1. Identify affected PVCs/PVs\n2. Check storage backend health\n3. Verify provisioner status\n4. Assess data integrity risk\n\n**Investigation Steps:**\n```bash\n# Check PVC status\nkubectl get pvc --all-namespaces\n\n# Describe pending PVC\nkubectl describe pvc \u003cpvc-name> -n \u003cnamespace>\n\n# Check PV status\nkubectl get pv\n\n# Check storage class\nkubectl get storageclass\n\n# Check provisioner\nkubectl get pods -n \u003cstorage-namespace>\n\n# Check volume attachments\nkubectl get volumeattachments\n```\n\n**Common Causes:**\n- Storage backend failure/full\n- Provisioner issues\n- Network to storage backend\n- Volume attachment limits reached\n- Corrupted volume\n\n**Resolution Paths:**\n1. Fix storage backend issues\n2. Restart provisioner if needed\n3. Manually provision PV if dynamic provisioning failed\n4. Delete and recreate if volume corrupted\n5. Restore from backup if data lost\n\n---\n\n### Scenario 5: Security Incident\n\n**Symptoms:**\n- Unauthorized access detected\n- Suspicious pod behavior\n- Security alerts triggered\n- Unusual network traffic\n\n**Immediate Actions:**\n1. Assess severity and scope\n2. Isolate affected resources\n3. Preserve evidence\n4. Engage security team\n\n**Investigation Steps:**\n```bash\n# Check recent RBAC changes\nkubectl get rolebindings,clusterrolebindings --all-namespaces -o json\n\n# Audit pod security contexts\nkubectl get pods --all-namespaces -o json | jq '.items[].spec.securityContext'\n\n# Check for privileged pods\nkubectl get pods --all-namespaces -o json | jq '.items[] | select(.spec.containers[].securityContext.privileged==true)'\n\n# Review service accounts\nkubectl get serviceaccounts --all-namespaces\n\n# Get audit logs\ncat /var/log/kubernetes/audit/audit.log | grep \u003csuspicious-activity>\n```\n\n**Common Causes:**\n- Compromised credentials\n- Vulnerable container image\n- Misconfigured RBAC\n- Exposed secrets\n- Supply chain attack\n\n**Resolution Paths:**\n1. Isolate: Network policies, cordon nodes\n2. Investigate: Audit logs, pod logs, network flows\n3. Remediate: Rotate credentials, patch vulnerabilities\n4. Restore: From known-good state if needed\n5. Prevent: Enhanced security policies, monitoring\n\n---\n\n## Diagnostic Commands Cheat Sheet\n\n### Quick Health Check\n```bash\n# Overall cluster health\nkubectl cluster-info\nkubectl get nodes\nkubectl get pods --all-namespaces | grep -v Running\n\n# Component status (older clusters)\nkubectl get componentstatuses\n\n# Recent events\nkubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20\n```\n\n### Pod Diagnostics\n```bash\n# Pod details\nkubectl describe pod \u003cpod> -n \u003cnamespace>\nkubectl get pod \u003cpod> -n \u003cnamespace> -o yaml\n\n# Logs\nkubectl logs \u003cpod> -n \u003cnamespace>\nkubectl logs \u003cpod> -n \u003cnamespace> --previous\nkubectl logs \u003cpod> -c \u003ccontainer> -n \u003cnamespace>\n\n# Interactive debugging\nkubectl exec -it \u003cpod> -n \u003cnamespace> -- /bin/sh\nkubectl debug \u003cpod> -it --image=busybox -n \u003cnamespace>\n```\n\n### Node Diagnostics\n```bash\n# Node details\nkubectl describe node \u003cnode>\nkubectl get node \u003cnode> -o yaml\n\n# Resource usage\nkubectl top nodes\nkubectl top pods --all-namespaces\n\n# Node conditions\nkubectl get nodes -o json | jq '.items[].status.conditions'\n```\n\n### Service & Network Diagnostics\n```bash\n# Service details\nkubectl describe svc \u003cservice> -n \u003cnamespace>\nkubectl get endpoints \u003cservice> -n \u003cnamespace>\n\n# Network policies\nkubectl get networkpolicies --all-namespaces\n\n# DNS testing\nkubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot\n# Then: nslookup \u003cservice>.\u003cnamespace>.svc.cluster.local\n```\n\n### Storage Diagnostics\n```bash\n# PVC and PV status\nkubectl get pvc,pv --all-namespaces\n\n# Storage class\nkubectl get storageclass\nkubectl describe storageclass \u003cstorage-class>\n\n# Volume attachments\nkubectl get volumeattachments\n```\n\n---\n\n## Communication During Incidents\n\n### Internal Communication\n- Use dedicated incident channel\n- Regular status updates (every 30 min)\n- Clear roles (incident commander, scribe, experts)\n- Document all actions taken\n\n### External Communication\n- Status page updates\n- Customer notifications\n- Clear expected resolution time\n- Updates on progress\n\n### Post-Incident Communication\n- Incident report\n- Root cause analysis\n- Remediation steps taken\n- Prevention measures\n\n---\n\n## Post-Incident Review Template\n\n### Incident Summary\n- Date and time\n- Duration\n- Severity\n- Services affected\n- User impact\n\n### Timeline\n- Detection time\n- Response time\n- Resolution time\n- Key events during incident\n\n### Root Cause\n- What happened\n- Why it happened\n- Contributing factors\n\n### Resolution\n- What fixed the issue\n- Who fixed it\n- How long it took\n\n### Lessons Learned\n- What went well\n- What could be improved\n- Action items with owners\n\n### Prevention\n- Technical changes\n- Process improvements\n- Monitoring enhancements\n- Documentation updates\n\n---\n\n## Best Practices\n\n### Prevention\n- Regular cluster audits\n- Proactive monitoring and alerting\n- Capacity planning\n- Regular disaster recovery drills\n- Automated backups\n- Security scanning and policies\n\n### Preparedness\n- Document runbooks\n- Practice incident response\n- Keep contact lists updated\n- Maintain up-to-date diagrams\n- Pre-provision debugging tools\n\n### Response\n- Follow structured approach\n- Document everything\n- Communicate clearly\n- Don't panic\n- Think before acting\n- Preserve evidence\n\n### Recovery\n- Verify fix thoroughly\n- Monitor for recurrence\n- Update documentation\n- Conduct post-mortem\n- Implement preventive measures\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":10058,"content_sha256":"d4e24e32446952fa3d8918a8efc2efc5046f26b9fbd6e6b2a168bebf752ca402"},{"filename":"references/performance_troubleshooting.md","content":"# Kubernetes Performance Troubleshooting\n\nSystematic approach to diagnosing and resolving Kubernetes performance issues.\n\n## Table of Contents\n\n1. [High Latency Issues](#high-latency-issues)\n2. [CPU Performance](#cpu-performance)\n3. [Memory Performance](#memory-performance)\n4. [Network Performance](#network-performance)\n5. [Storage I/O Performance](#storage-io-performance)\n6. [Application-Level Metrics](#application-level-metrics)\n7. [Cluster-Wide Performance](#cluster-wide-performance)\n\n---\n\n## High Latency Issues\n\n### Symptoms\n- Slow API response times\n- Increased request latency\n- Timeouts\n- Degraded user experience\n\n### Investigation Workflow\n\n**1. Identify the layer with latency:**\n\n```bash\n# Check service mesh metrics (if using Istio/Linkerd)\nkubectl top pods -n \u003cnamespace>\n\n# Check ingress controller metrics\nkubectl logs -n ingress-nginx \u003cingress-controller-pod> | grep \"request_time\"\n\n# Check application logs for slow requests\nkubectl logs \u003cpod-name> -n \u003cnamespace> | grep -i \"slow\\|timeout\\|latency\"\n```\n\n**2. Profile application performance:**\n\n```bash\n# Get pod metrics\nkubectl top pod \u003cpod-name> -n \u003cnamespace>\n\n# Check if pod is CPU throttled\nkubectl get pod \u003cpod-name> -n \u003cnamespace> -o json | \\\n jq '.spec.containers[].resources'\n\n# Exec into pod and check application-specific metrics\nkubectl exec -it \u003cpod-name> -n \u003cnamespace> -- /bin/sh\n# Then: curl localhost:8080/metrics (if Prometheus metrics available)\n```\n\n**3. Check dependencies:**\n\n```bash\n# Test connectivity to downstream services\nkubectl exec -it \u003cpod-name> -n \u003cnamespace> -- \\\n curl -w \"@curl-format.txt\" -o /dev/null -s http://backend-service\n\n# curl-format.txt content:\n# time_namelookup: %{time_namelookup}\\n\n# time_connect: %{time_connect}\\n\n# time_appconnect: %{time_appconnect}\\n\n# time_pretransfer: %{time_pretransfer}\\n\n# time_redirect: %{time_redirect}\\n\n# time_starttransfer: %{time_starttransfer}\\n\n# time_total: %{time_total}\\n\n```\n\n### Common Causes and Solutions\n\n**CPU Throttling:**\n```yaml\n# Increase CPU limits or remove limits for bursty workloads\nresources:\n requests:\n cpu: \"500m\" # What pod needs typically\n limits:\n cpu: \"2000m\" # Burst capacity (or remove for unlimited)\n```\n\n**Insufficient Replicas:**\n```bash\n# Scale up deployment\nkubectl scale deployment \u003cdeployment-name> -n \u003cnamespace> --replicas=5\n\n# Or enable HPA\nkubectl autoscale deployment \u003cdeployment-name> \\\n --cpu-percent=70 \\\n --min=2 \\\n --max=10\n```\n\n**Slow Dependencies:**\n```yaml\n# Implement circuit breakers and timeouts in application\n# Or use service mesh policies (Istio example):\napiVersion: networking.istio.io/v1beta1\nkind: DestinationRule\nmetadata:\n name: backend-circuit-breaker\nspec:\n host: backend-service\n trafficPolicy:\n connectionPool:\n tcp:\n maxConnections: 100\n http:\n http1MaxPendingRequests: 50\n http2MaxRequests: 100\n outlierDetection:\n consecutiveErrors: 5\n interval: 30s\n baseEjectionTime: 30s\n```\n\n---\n\n## CPU Performance\n\n### Symptoms\n- High CPU usage\n- Throttling\n- Slow processing\n- Queue buildup\n\n### Investigation Commands\n\n```bash\n# Check CPU usage\nkubectl top nodes\nkubectl top pods -n \u003cnamespace>\n\n# Check CPU throttling\nkubectl get pod \u003cpod-name> -n \u003cnamespace> -o json | \\\n jq '.spec.containers[].resources'\n\n# Get detailed CPU metrics (requires metrics-server)\nkubectl get --raw \"/apis/metrics.k8s.io/v1beta1/namespaces/\u003cnamespace>/pods/\u003cpod-name>\" | jq\n\n# Check container-level CPU from node (SSH to node)\nssh \u003cnode> \"docker stats --no-stream\"\n```\n\n### Advanced CPU Profiling\n\n**Enable CPU profiling in application:**\n\n```bash\n# For Go applications with pprof\nkubectl port-forward \u003cpod-name> 6060:6060 -n \u003cnamespace>\n\n# Capture CPU profile\ncurl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof\n\n# Analyze with pprof\ngo tool pprof -http=:8080 cpu.prof\n```\n\n**For Java applications:**\n\n```bash\n# Use async-profiler\nkubectl exec -it \u003cpod-name> -n \u003cnamespace> -- \\\n /profiler.sh -d 30 -f /tmp/flamegraph.html 1\n\n# Copy flamegraph\nkubectl cp \u003cnamespace>/\u003cpod-name>:/tmp/flamegraph.html ./flamegraph.html\n```\n\n### Solutions\n\n**Vertical Scaling:**\n```yaml\nresources:\n requests:\n cpu: \"1000m\" # Increased from 500m\n limits:\n cpu: \"2000m\" # Increased from 1000m\n```\n\n**Horizontal Scaling:**\n```yaml\napiVersion: autoscaling/v2\nkind: HorizontalPodAutoscaler\nmetadata:\n name: app-hpa\nspec:\n scaleTargetRef:\n apiVersion: apps/v1\n kind: Deployment\n name: app\n minReplicas: 3\n maxReplicas: 20\n metrics:\n - type: Resource\n resource:\n name: cpu\n target:\n type: Utilization\n averageUtilization: 70\n```\n\n**Remove CPU Limits for Bursty Workloads:**\n```yaml\n# Allow bursting to available CPU\nresources:\n requests:\n cpu: \"500m\"\n # No limits - can use all available CPU\n```\n\n---\n\n## Memory Performance\n\n### Symptoms\n- OOMKilled pods\n- Memory leaks\n- Slow garbage collection\n- Swap usage (if enabled)\n\n### Investigation Commands\n\n```bash\n# Check memory usage\nkubectl top nodes\nkubectl top pods -n \u003cnamespace>\n\n# Check memory limits and requests\nkubectl describe pod \u003cpod-name> -n \u003cnamespace> | grep -A 5 \"Limits\\|Requests\"\n\n# Check OOM kills\nkubectl get pods -n \u003cnamespace> -o json | \\\n jq '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == \"OOMKilled\") | .metadata.name'\n\n# Detailed memory breakdown (requires metrics-server)\nkubectl get --raw \"/apis/metrics.k8s.io/v1beta1/namespaces/\u003cnamespace>/pods/\u003cpod-name>\" | \\\n jq '.containers[] | {name, usage: .usage.memory}'\n```\n\n### Memory Profiling\n\n**Heap dump for Java:**\n```bash\n# Capture heap dump\nkubectl exec \u003cpod-name> -n \u003cnamespace> -- \\\n jmap -dump:format=b,file=/tmp/heapdump.hprof 1\n\n# Copy heap dump\nkubectl cp \u003cnamespace>/\u003cpod-name>:/tmp/heapdump.hprof ./heapdump.hprof\n\n# Analyze with Eclipse MAT or VisualVM\n```\n\n**Memory profiling for Go:**\n```bash\n# Capture heap profile\nkubectl port-forward \u003cpod-name> 6060:6060 -n \u003cnamespace>\ncurl http://localhost:6060/debug/pprof/heap > heap.prof\n\n# Analyze\ngo tool pprof -http=:8080 heap.prof\n```\n\n### Solutions\n\n**Increase Memory Limits:**\n```yaml\nresources:\n requests:\n memory: \"512Mi\"\n limits:\n memory: \"2Gi\" # Increased from 1Gi\n```\n\n**Optimize Application:**\n- Fix memory leaks\n- Implement connection pooling\n- Optimize caching strategies\n- Tune garbage collection\n\n**Use Memory-Optimized Node Pools:**\n```yaml\n# Node affinity for memory-intensive workloads\naffinity:\n nodeAffinity:\n requiredDuringSchedulingIgnoredDuringExecution:\n nodeSelectorTerms:\n - matchExpressions:\n - key: workload-type\n operator: In\n values:\n - memory-optimized\n```\n\n---\n\n## Network Performance\n\n### Symptoms\n- High network latency\n- Packet loss\n- Connection timeouts\n- Bandwidth saturation\n\n### Investigation Commands\n\n```bash\n# Check pod network statistics\nkubectl exec \u003cpod-name> -n \u003cnamespace> -- netstat -s\n\n# Test network performance between pods\n# Deploy netperf\nkubectl run netperf-client --image=networkstatic/netperf --rm -it -- /bin/bash\n\n# From client, run:\nnetperf -H \u003ctarget-pod-ip> -t TCP_STREAM\nnetperf -H \u003ctarget-pod-ip> -t TCP_RR # Request-response latency\n\n# Check DNS resolution time\nkubectl exec \u003cpod-name> -n \u003cnamespace> -- \\\n time nslookup service-name.namespace.svc.cluster.local\n\n# Check service mesh overhead (if using Istio)\nkubectl exec \u003cpod-name> -n \u003cnamespace> -c istio-proxy -- \\\n curl -s localhost:15000/stats | grep \"http.inbound\\|http.outbound\"\n```\n\n### Check Network Policies\n\n```bash\n# List network policies\nkubectl get networkpolicies -n \u003cnamespace>\n\n# Check if policy is blocking traffic\nkubectl describe networkpolicy \u003cpolicy-name> -n \u003cnamespace>\n\n# Temporarily remove policies to test (in non-production)\nkubectl delete networkpolicy \u003cpolicy-name> -n \u003cnamespace>\n```\n\n### Solutions\n\n**DNS Optimization:**\n```yaml\n# Use CoreDNS caching\n# Increase CoreDNS replicas\nkubectl scale deployment coredns -n kube-system --replicas=5\n\n# Or use NodeLocal DNSCache\n# https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/\n```\n\n**Optimize Service Mesh:**\n```yaml\n# Reduce Istio sidecar resources if over-provisioned\nsidecar.istio.io/proxyCPU: \"100m\"\nsidecar.istio.io/proxyMemory: \"128Mi\"\n\n# Or disable for internal, trusted services\nsidecar.istio.io/inject: \"false\"\n```\n\n**Use HostNetwork for Network-Intensive Pods:**\n```yaml\n# Use with caution - bypasses pod networking\nspec:\n hostNetwork: true\n dnsPolicy: ClusterFirstWithHostNet\n```\n\n**Enable Bandwidth Limits (QoS):**\n```yaml\nmetadata:\n annotations:\n kubernetes.io/ingress-bandwidth: \"10M\"\n kubernetes.io/egress-bandwidth: \"10M\"\n```\n\n---\n\n## Storage I/O Performance\n\n### Symptoms\n- Slow read/write operations\n- High I/O wait\n- Application timeouts during disk operations\n- Database performance issues\n\n### Investigation Commands\n\n```bash\n# Check I/O metrics on node\nssh \u003cnode> \"iostat -x 1 10\"\n\n# Check disk usage\nkubectl exec \u003cpod-name> -n \u003cnamespace> -- df -h\n\n# Check I/O wait from pod\nkubectl exec \u003cpod-name> -n \u003cnamespace> -- top\n\n# Test storage performance\nkubectl exec \u003cpod-name> -n \u003cnamespace> -- \\\n dd if=/dev/zero of=/data/test bs=1M count=1024 conv=fdatasync\n\n# Check PV performance class\nkubectl get pv \u003cpv-name> -o yaml | grep storageClassName\nkubectl describe storageclass \u003cstorage-class-name>\n```\n\n### Storage Benchmarking\n\n**Deploy fio for benchmarking:**\n```yaml\napiVersion: v1\nkind: Pod\nmetadata:\n name: fio-benchmark\nspec:\n containers:\n - name: fio\n image: ljishen/fio\n command: [\"/bin/sh\", \"-c\"]\n args:\n - |\n fio --name=seqread --rw=read --bs=1M --size=1G --runtime=60 --filename=/data/test\n fio --name=seqwrite --rw=write --bs=1M --size=1G --runtime=60 --filename=/data/test\n fio --name=randread --rw=randread --bs=4k --size=1G --runtime=60 --filename=/data/test\n fio --name=randwrite --rw=randwrite --bs=4k --size=1G --runtime=60 --filename=/data/test\n volumeMounts:\n - name: data\n mountPath: /data\n volumes:\n - name: data\n persistentVolumeClaim:\n claimName: test-pvc\n```\n\n### Solutions\n\n**Use Higher Performance Storage Class:**\n```yaml\napiVersion: v1\nkind: PersistentVolumeClaim\nmetadata:\n name: high-performance-pvc\nspec:\n accessModes:\n - ReadWriteOnce\n storageClassName: gp3 # or io2, premium-rwo (GKE), etc.\n resources:\n requests:\n storage: 100Gi\n```\n\n**Provision IOPS (AWS EBS io2):**\n```yaml\napiVersion: storage.k8s.io/v1\nkind: StorageClass\nmetadata:\n name: io2-high-iops\nprovisioner: ebs.csi.aws.com\nparameters:\n type: io2\n iops: \"10000\"\n fsType: ext4\nvolumeBindingMode: WaitForFirstConsumer\n```\n\n**Use Local NVMe for Ultra-Low Latency:**\n```yaml\napiVersion: storage.k8s.io/v1\nkind: StorageClass\nmetadata:\n name: local-nvme\nprovisioner: kubernetes.io/no-provisioner\nvolumeBindingMode: WaitForFirstConsumer\n---\napiVersion: v1\nkind: PersistentVolume\nmetadata:\n name: local-pv\nspec:\n capacity:\n storage: 100Gi\n accessModes:\n - ReadWriteOnce\n persistentVolumeReclaimPolicy: Retain\n storageClassName: local-nvme\n local:\n path: /mnt/disks/nvme0n1\n nodeAffinity:\n required:\n nodeSelectorTerms:\n - matchExpressions:\n - key: kubernetes.io/hostname\n operator: In\n values:\n - node-with-nvme\n```\n\n---\n\n## Application-Level Metrics\n\n### Expose Prometheus Metrics\n\n**Add metrics endpoint to application:**\n```yaml\napiVersion: v1\nkind: Service\nmetadata:\n name: app-metrics\n annotations:\n prometheus.io/scrape: \"true\"\n prometheus.io/port: \"8080\"\n prometheus.io/path: \"/metrics\"\nspec:\n selector:\n app: myapp\n ports:\n - name: metrics\n port: 8080\n targetPort: 8080\n```\n\n### Key Metrics to Monitor\n\n**Application metrics:**\n- Request rate\n- Request latency (p50, p95, p99)\n- Error rate\n- Active connections\n- Queue depth\n- Cache hit rate\n\n**Example Prometheus queries:**\n```promql\n# P95 latency\nhistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))\n\n# Error rate\nsum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))\n\n# Request rate\nsum(rate(http_requests_total[5m]))\n```\n\n### Distributed Tracing\n\n**Implement OpenTelemetry:**\n```yaml\n# Deploy Jaeger\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: jaeger\nspec:\n template:\n spec:\n containers:\n - name: jaeger\n image: jaegertracing/all-in-one:latest\n ports:\n - containerPort: 16686 # UI\n - containerPort: 14268 # Collector\n```\n\n**Instrument application:**\n- Add OpenTelemetry SDK to application\n- Configure trace export to Jaeger\n- Analyze end-to-end request traces to identify bottlenecks\n\n---\n\n## Cluster-Wide Performance\n\n### Cluster Resource Utilization\n\n```bash\n# Overall cluster capacity\nkubectl top nodes\n\n# Total resources\nkubectl describe nodes | grep -A 5 \"Allocated resources\"\n\n# Resource requests vs limits\nkubectl get pods --all-namespaces -o json | \\\n jq -r '.items[] | \"\\(.metadata.namespace)/\\(.metadata.name) \\(.spec.containers[].resources)\"'\n```\n\n### Control Plane Performance\n\n```bash\n# Check API server latency\nkubectl get --raw /metrics | grep apiserver_request_duration_seconds\n\n# Check etcd performance\nkubectl exec -it -n kube-system etcd-\u003cnode> -- \\\n etcdctl --endpoints=https://127.0.0.1:2379 \\\n --cacert=/etc/kubernetes/pki/etcd/ca.crt \\\n --cert=/etc/kubernetes/pki/etcd/server.crt \\\n --key=/etc/kubernetes/pki/etcd/server.key \\\n check perf\n\n# Controller manager metrics\nkubectl get --raw /metrics | grep workqueue_depth\n```\n\n### Scheduler Performance\n\n```bash\n# Check scheduler latency\nkubectl get --raw /metrics | grep scheduler_scheduling_duration_seconds\n\n# Check pending pods\nkubectl get pods --all-namespaces --field-selector status.phase=Pending\n\n# Scheduler logs\nkubectl logs -n kube-system kube-scheduler-\u003cnode>\n```\n\n### Solutions for Cluster-Wide Issues\n\n**Scale Control Plane:**\n- Add more control plane nodes\n- Increase API server replicas\n- Tune etcd (increase memory, use SSD)\n\n**Optimize Scheduling:**\n- Use pod priority and preemption\n- Implement pod topology spread constraints\n- Use node affinity/anti-affinity appropriately\n\n**Resource Management:**\n- Set appropriate resource requests and limits\n- Use LimitRanges and ResourceQuotas\n- Implement VerticalPodAutoscaler for right-sizing\n\n---\n\n## Performance Optimization Checklist\n\n### Application Level\n- [ ] Implement connection pooling\n- [ ] Enable response caching\n- [ ] Optimize database queries\n- [ ] Use async/non-blocking I/O\n- [ ] Implement circuit breakers\n- [ ] Profile and optimize hot paths\n\n### Kubernetes Level\n- [ ] Set appropriate resource requests/limits\n- [ ] Use HPA for auto-scaling\n- [ ] Implement readiness/liveness probes correctly\n- [ ] Use anti-affinity for high-availability\n- [ ] Optimize container image size\n- [ ] Use multi-stage builds\n\n### Infrastructure Level\n- [ ] Use appropriate instance/node types\n- [ ] Enable cluster autoscaling\n- [ ] Use high-performance storage classes\n- [ ] Optimize network topology\n- [ ] Implement monitoring and alerting\n- [ ] Regular performance testing\n\n---\n\n## Monitoring Tools\n\n**Essential tools:**\n- **Prometheus + Grafana**: Metrics and dashboards\n- **Jaeger/Zipkin**: Distributed tracing\n- **kube-state-metrics**: Kubernetes object metrics\n- **node-exporter**: Node-level metrics\n- **cAdvisor**: Container metrics\n- **kubectl-flamegraph**: CPU profiling\n\n**Commercial options:**\n- Datadog\n- New Relic\n- Dynatrace\n- Elastic APM\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":15454,"content_sha256":"d661ca42ed4391317aa2430ff0cb35c8a12bd656d7e098f3f49f0baa44c4041f"},{"filename":"scripts/check_namespace.py","content":"#!/usr/bin/env python3\n\"\"\"\nKubernetes Namespace Health Check\nPerforms comprehensive health diagnostics for a specific namespace\n\"\"\"\nimport argparse\nimport json\nimport subprocess\nimport sys\nfrom typing import Dict, List, Any\nfrom datetime import datetime\n\n\ndef run_kubectl(args: List[str], namespace: str = None) -> Dict[str, Any]:\n \"\"\"Run kubectl command and return parsed JSON\"\"\"\n cmd = ['kubectl'] + args\n if namespace and '-n' not in args and '--namespace' not in args:\n cmd.extend(['-n', namespace])\n\n try:\n result = subprocess.run(\n cmd,\n capture_output=True,\n text=True,\n check=True\n )\n if result.stdout:\n return json.loads(result.stdout)\n return {}\n except subprocess.CalledProcessError as e:\n return {\"error\": e.stderr}\n except json.JSONDecodeError:\n return {\"error\": \"Failed to parse kubectl output\", \"output\": result.stdout}\n\n\ndef check_pods(namespace: str) -> Dict[str, Any]:\n \"\"\"Check pod health in namespace\"\"\"\n pods = run_kubectl(['get', 'pods', '-o', 'json'], namespace)\n\n if 'error' in pods:\n return pods\n\n results = {\n \"total\": 0,\n \"running\": 0,\n \"pending\": 0,\n \"failed\": 0,\n \"succeeded\": 0,\n \"crashlooping\": 0,\n \"image_pull_errors\": 0,\n \"issues\": [],\n \"healthy_pods\": [],\n \"unhealthy_pods\": []\n }\n\n for pod in pods.get('items', []):\n name = pod['metadata']['name']\n phase = pod.get('status', {}).get('phase', 'Unknown')\n results[\"total\"] += 1\n\n # Check container statuses\n container_statuses = pod.get('status', {}).get('containerStatuses', [])\n restart_count = sum(c.get('restartCount', 0) for c in container_statuses)\n\n # Categorize pod status\n if phase == 'Running':\n all_ready = all(c.get('ready', False) for c in container_statuses)\n if all_ready and restart_count \u003c 5:\n results[\"running\"] += 1\n results[\"healthy_pods\"].append(name)\n else:\n results[\"running\"] += 1\n if restart_count >= 5:\n results[\"crashlooping\"] += 1\n results[\"issues\"].append(f\"Pod {name}: High restart count ({restart_count})\")\n results[\"unhealthy_pods\"].append(name)\n if not all_ready:\n results[\"issues\"].append(f\"Pod {name}: Not all containers ready\")\n results[\"unhealthy_pods\"].append(name)\n\n elif phase == 'Pending':\n results[\"pending\"] += 1\n results[\"issues\"].append(f\"Pod {name}: Stuck in Pending state\")\n results[\"unhealthy_pods\"].append(name)\n\n elif phase == 'Failed':\n results[\"failed\"] += 1\n results[\"issues\"].append(f\"Pod {name}: Failed\")\n results[\"unhealthy_pods\"].append(name)\n\n elif phase == 'Succeeded':\n results[\"succeeded\"] += 1\n\n # Check for ImagePullBackOff\n for container_status in container_statuses:\n waiting = container_status.get('state', {}).get('waiting', {})\n reason = waiting.get('reason', '')\n if 'ImagePull' in reason or 'ErrImagePull' in reason:\n results[\"image_pull_errors\"] += 1\n if name not in results[\"unhealthy_pods\"]:\n results[\"unhealthy_pods\"].append(name)\n results[\"issues\"].append(f\"Pod {name}: {reason}\")\n\n return results\n\n\ndef check_services(namespace: str) -> Dict[str, Any]:\n \"\"\"Check services and their endpoints\"\"\"\n services = run_kubectl(['get', 'services', '-o', 'json'], namespace)\n\n if 'error' in services:\n return services\n\n results = {\n \"total\": 0,\n \"with_endpoints\": 0,\n \"without_endpoints\": 0,\n \"load_balancers\": 0,\n \"load_balancers_pending\": 0,\n \"issues\": []\n }\n\n for svc in services.get('items', []):\n name = svc['metadata']['name']\n svc_type = svc['spec'].get('type', 'ClusterIP')\n results[\"total\"] += 1\n\n # Check endpoints\n endpoints = run_kubectl(['get', 'endpoints', name, '-o', 'json'], namespace)\n if 'error' not in endpoints:\n subsets = endpoints.get('subsets', [])\n if subsets and any(s.get('addresses', []) for s in subsets):\n results[\"with_endpoints\"] += 1\n else:\n results[\"without_endpoints\"] += 1\n results[\"issues\"].append(f\"Service {name}: No endpoints (no pods matching selector)\")\n\n # Check LoadBalancer status\n if svc_type == 'LoadBalancer':\n results[\"load_balancers\"] += 1\n lb_ingress = svc['status'].get('loadBalancer', {}).get('ingress', [])\n if not lb_ingress:\n results[\"load_balancers_pending\"] += 1\n results[\"issues\"].append(f\"Service {name}: LoadBalancer stuck in Pending\")\n\n return results\n\n\ndef check_deployments(namespace: str) -> Dict[str, Any]:\n \"\"\"Check deployment health\"\"\"\n deployments = run_kubectl(['get', 'deployments', '-o', 'json'], namespace)\n\n if 'error' in deployments:\n return deployments\n\n results = {\n \"total\": 0,\n \"available\": 0,\n \"unavailable\": 0,\n \"progressing\": 0,\n \"issues\": []\n }\n\n for deploy in deployments.get('items', []):\n name = deploy['metadata']['name']\n results[\"total\"] += 1\n\n status = deploy.get('status', {})\n replicas = status.get('replicas', 0)\n ready_replicas = status.get('readyReplicas', 0)\n available_replicas = status.get('availableReplicas', 0)\n\n if available_replicas == replicas and available_replicas > 0:\n results[\"available\"] += 1\n elif available_replicas == 0:\n results[\"unavailable\"] += 1\n results[\"issues\"].append(f\"Deployment {name}: No replicas available ({ready_replicas}/{replicas})\")\n else:\n results[\"progressing\"] += 1\n results[\"issues\"].append(f\"Deployment {name}: Partially available ({available_replicas}/{replicas})\")\n\n return results\n\n\ndef check_pvcs(namespace: str) -> Dict[str, Any]:\n \"\"\"Check PersistentVolumeClaims\"\"\"\n pvcs = run_kubectl(['get', 'pvc', '-o', 'json'], namespace)\n\n if 'error' in pvcs:\n return pvcs\n\n results = {\n \"total\": 0,\n \"bound\": 0,\n \"pending\": 0,\n \"lost\": 0,\n \"issues\": []\n }\n\n for pvc in pvcs.get('items', []):\n name = pvc['metadata']['name']\n phase = pvc.get('status', {}).get('phase', 'Unknown')\n results[\"total\"] += 1\n\n if phase == 'Bound':\n results[\"bound\"] += 1\n elif phase == 'Pending':\n results[\"pending\"] += 1\n results[\"issues\"].append(f\"PVC {name}: Stuck in Pending state\")\n elif phase == 'Lost':\n results[\"lost\"] += 1\n results[\"issues\"].append(f\"PVC {name}: Volume lost\")\n\n return results\n\n\ndef check_resource_quotas(namespace: str) -> Dict[str, Any]:\n \"\"\"Check resource quotas and usage\"\"\"\n quotas = run_kubectl(['get', 'resourcequota', '-o', 'json'], namespace)\n\n if 'error' in quotas:\n return {\"total\": 0, \"issues\": []}\n\n results = {\n \"total\": 0,\n \"near_limit\": [],\n \"exceeded\": [],\n \"issues\": []\n }\n\n for quota in quotas.get('items', []):\n name = quota['metadata']['name']\n results[\"total\"] += 1\n\n status = quota.get('status', {})\n hard = status.get('hard', {})\n used = status.get('used', {})\n\n for resource, limit in hard.items():\n usage = used.get(resource, '0')\n\n # Parse values (handle different formats: CPU, memory, counts)\n try:\n if resource.endswith('memory'):\n # Convert to bytes for comparison\n limit_val = parse_memory(limit)\n usage_val = parse_memory(usage)\n elif resource.endswith('cpu'):\n # Convert to millicores\n limit_val = parse_cpu(limit)\n usage_val = parse_cpu(usage)\n else:\n # Plain numbers\n limit_val = int(limit)\n usage_val = int(usage)\n\n if limit_val > 0:\n usage_percent = (usage_val / limit_val) * 100\n\n if usage_percent >= 100:\n results[\"exceeded\"].append(resource)\n results[\"issues\"].append(f\"Quota {name}: {resource} exceeded ({usage}/{limit})\")\n elif usage_percent >= 80:\n results[\"near_limit\"].append(resource)\n results[\"issues\"].append(f\"Quota {name}: {resource} near limit ({usage}/{limit}, {usage_percent:.0f}%)\")\n\n except (ValueError, AttributeError):\n continue\n\n return results\n\n\ndef parse_memory(value: str) -> int:\n \"\"\"Parse memory string to bytes\"\"\"\n units = {'Ki': 1024, 'Mi': 1024**2, 'Gi': 1024**3, 'Ti': 1024**4}\n for unit, multiplier in units.items():\n if value.endswith(unit):\n return int(value[:-2]) * multiplier\n return int(value)\n\n\ndef parse_cpu(value: str) -> int:\n \"\"\"Parse CPU string to millicores\"\"\"\n if value.endswith('m'):\n return int(value[:-1])\n return int(float(value) * 1000)\n\n\ndef get_recent_events(namespace: str, limit: int = 10) -> List[Dict[str, Any]]:\n \"\"\"Get recent events in namespace\"\"\"\n events = run_kubectl(['get', 'events', '--sort-by=.lastTimestamp', '-o', 'json'], namespace)\n\n if 'error' in events:\n return []\n\n recent_events = []\n for event in events.get('items', [])[-limit:]:\n recent_events.append({\n \"type\": event.get('type', 'Unknown'),\n \"reason\": event.get('reason', ''),\n \"message\": event.get('message', ''),\n \"object\": f\"{event.get('involvedObject', {}).get('kind', '')}/{event.get('involvedObject', {}).get('name', '')}\",\n \"count\": event.get('count', 1),\n \"last_timestamp\": event.get('lastTimestamp', '')\n })\n\n return recent_events\n\n\ndef generate_recommendations(results: Dict[str, Any]) -> List[str]:\n \"\"\"Generate actionable recommendations based on findings\"\"\"\n recommendations = []\n\n # Pod recommendations\n if results['pods']['pending'] > 0:\n recommendations.append(\"⚠️ Check pending pods with: kubectl describe pod \u003cpod-name> -n \u003cnamespace>\")\n recommendations.append(\"⚠️ Verify node resources: kubectl describe nodes\")\n\n if results['pods']['crashlooping'] > 0:\n recommendations.append(\"⚠️ Investigate crashlooping pods: kubectl logs \u003cpod-name> -n \u003cnamespace> --previous\")\n\n if results['pods']['image_pull_errors'] > 0:\n recommendations.append(\"⚠️ Fix image pull errors: verify image name, check imagePullSecrets\")\n\n # Service recommendations\n if results['services']['without_endpoints'] > 0:\n recommendations.append(\"⚠️ Services without endpoints: check pod selectors match pod labels\")\n\n if results['services']['load_balancers_pending'] > 0:\n recommendations.append(\"⚠️ LoadBalancer stuck: check cloud provider controller logs\")\n\n # Deployment recommendations\n if results['deployments']['unavailable'] > 0:\n recommendations.append(\"⚠️ Unavailable deployments: check pod errors and resource availability\")\n\n # PVC recommendations\n if results['pvcs']['pending'] > 0:\n recommendations.append(\"⚠️ Pending PVCs: verify StorageClass exists and provisioner is working\")\n\n # Quota recommendations\n if results['quotas']['exceeded']:\n recommendations.append(f\"🚨 Resource quotas exceeded: {', '.join(results['quotas']['exceeded'])}\")\n recommendations.append(\"🚨 Action required: increase quota or reduce resource requests\")\n\n if results['quotas']['near_limit']:\n recommendations.append(f\"⚠️ Near quota limits: {', '.join(results['quotas']['near_limit'])}\")\n\n if not recommendations:\n recommendations.append(\"✅ No critical issues detected\")\n\n return recommendations\n\n\ndef main():\n parser = argparse.ArgumentParser(\n description=\"Comprehensive health check for a Kubernetes namespace\",\n formatter_class=argparse.RawDescriptionHelpFormatter,\n epilog=\"\"\"\nExamples:\n # Check namespace with human-readable output\n %(prog)s my-namespace\n\n # Output as JSON\n %(prog)s my-namespace --json\n\n # Include more events\n %(prog)s my-namespace --events 20\n \"\"\"\n )\n\n parser.add_argument(\n \"namespace\",\n help=\"Namespace to check\"\n )\n\n parser.add_argument(\n \"--json\",\n action=\"store_true\",\n help=\"Output results as JSON\"\n )\n\n parser.add_argument(\n \"--events\",\n type=int,\n default=10,\n help=\"Number of recent events to include (default: 10)\"\n )\n\n args = parser.parse_args()\n\n # Perform all checks\n results = {\n \"namespace\": args.namespace,\n \"timestamp\": datetime.utcnow().isoformat() + \"Z\",\n \"pods\": check_pods(args.namespace),\n \"services\": check_services(args.namespace),\n \"deployments\": check_deployments(args.namespace),\n \"pvcs\": check_pvcs(args.namespace),\n \"quotas\": check_resource_quotas(args.namespace),\n \"recent_events\": get_recent_events(args.namespace, args.events)\n }\n\n # Generate recommendations\n results[\"recommendations\"] = generate_recommendations(results)\n\n # Determine overall health\n total_issues = (\n len(results[\"pods\"].get(\"issues\", [])) +\n len(results[\"services\"].get(\"issues\", [])) +\n len(results[\"deployments\"].get(\"issues\", [])) +\n len(results[\"pvcs\"].get(\"issues\", [])) +\n len(results[\"quotas\"].get(\"issues\", []))\n )\n\n results[\"health_status\"] = \"healthy\" if total_issues == 0 else \"degraded\" if total_issues \u003c 5 else \"critical\"\n\n if args.json:\n print(json.dumps(results, indent=2))\n else:\n # Human-readable output\n print(f\"🔍 Namespace Health Check: {args.namespace}\")\n print(f\"⏰ Timestamp: {results['timestamp']}\")\n print(f\"📊 Overall Status: {results['health_status'].upper()}\\n\")\n\n # Pods\n print(\"📦 Pods:\")\n print(f\" Total: {results['pods']['total']}\")\n print(f\" Running: {results['pods']['running']}\")\n print(f\" Pending: {results['pods']['pending']}\")\n print(f\" Failed: {results['pods']['failed']}\")\n if results['pods']['crashlooping'] > 0:\n print(f\" ⚠️ CrashLooping: {results['pods']['crashlooping']}\")\n if results['pods']['image_pull_errors'] > 0:\n print(f\" ⚠️ ImagePull Errors: {results['pods']['image_pull_errors']}\")\n print()\n\n # Services\n print(\"🌐 Services:\")\n print(f\" Total: {results['services']['total']}\")\n print(f\" With Endpoints: {results['services']['with_endpoints']}\")\n if results['services']['without_endpoints'] > 0:\n print(f\" ⚠️ Without Endpoints: {results['services']['without_endpoints']}\")\n if results['services']['load_balancers_pending'] > 0:\n print(f\" ⚠️ LB Pending: {results['services']['load_balancers_pending']}\")\n print()\n\n # Deployments\n if results['deployments']['total'] > 0:\n print(\"🚀 Deployments:\")\n print(f\" Total: {results['deployments']['total']}\")\n print(f\" Available: {results['deployments']['available']}\")\n if results['deployments']['unavailable'] > 0:\n print(f\" ⚠️ Unavailable: {results['deployments']['unavailable']}\")\n print()\n\n # PVCs\n if results['pvcs']['total'] > 0:\n print(\"💾 PersistentVolumeClaims:\")\n print(f\" Total: {results['pvcs']['total']}\")\n print(f\" Bound: {results['pvcs']['bound']}\")\n if results['pvcs']['pending'] > 0:\n print(f\" ⚠️ Pending: {results['pvcs']['pending']}\")\n print()\n\n # Quotas\n if results['quotas']['total'] > 0:\n print(\"📏 Resource Quotas:\")\n print(f\" Total: {results['quotas']['total']}\")\n if results['quotas']['exceeded']:\n print(f\" 🚨 Exceeded: {', '.join(results['quotas']['exceeded'])}\")\n if results['quotas']['near_limit']:\n print(f\" ⚠️ Near Limit: {', '.join(results['quotas']['near_limit'])}\")\n print()\n\n # Issues\n if total_issues > 0:\n print(f\"⚠️ Issues ({total_issues}):\")\n all_issues = (\n results[\"pods\"].get(\"issues\", []) +\n results[\"services\"].get(\"issues\", []) +\n results[\"deployments\"].get(\"issues\", []) +\n results[\"pvcs\"].get(\"issues\", []) +\n results[\"quotas\"].get(\"issues\", [])\n )\n for issue in all_issues[:10]: # Show first 10\n print(f\" - {issue}\")\n if len(all_issues) > 10:\n print(f\" ... and {len(all_issues) - 10} more (use --json for full list)\")\n print()\n\n # Recommendations\n print(\"💡 Recommendations:\")\n for rec in results[\"recommendations\"]:\n print(f\" {rec}\")\n\n sys.exit(0 if results[\"health_status\"] in [\"healthy\", \"degraded\"] else 1)\n\n\nif __name__ == \"__main__\":\n main()\n","content_type":"text/x-python; charset=utf-8","language":"python","size":17643,"content_sha256":"56fffe4def4e12b121f9dcf4e9b81533e7b937b23d8647e0a69d710472d9a2dd"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"Kubernetes Troubleshooter & Incident Response","type":"text"}]},{"type":"paragraph","content":[{"text":"Systematic approach to diagnosing and resolving Kubernetes issues in production environments.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Core Troubleshooting Workflow","type":"text"}]},{"type":"paragraph","content":[{"text":"Follow this systematic approach for any Kubernetes issue:","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"1. Gather Context","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"What is the observed symptom?","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"When did it start?","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"What changed recently (deployments, config, infrastructure)?","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"What is the scope (single pod, service, node, cluster)?","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"What is the business impact (severity level)?","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"2. Initial Triage","type":"text"}]},{"type":"paragraph","content":[{"text":"Run cluster health check:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Check node status and health\nkubectl get nodes\n\n# Find non-running pods across all namespaces\nkubectl get pods -A --field-selector status.phase!=Running\n\n# Check node resource usage\nkubectl top nodes","type":"text"}]},{"type":"paragraph","content":[{"text":"This provides an overview of:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Node health status","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Pending and failed pods across all namespaces","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Node resource utilization","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"3. Deep Dive Investigation","type":"text"}]},{"type":"paragraph","content":[{"text":"Based on triage results, focus investigation:","type":"text"}]},{"type":"paragraph","content":[{"text":"For Namespace-Level Issues:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python3 scripts/check_namespace.py \u003cnamespace>","type":"text"}]},{"type":"paragraph","content":[{"text":"This provides comprehensive namespace health:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Pod status (running, pending, failed, crashlooping)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Service health and endpoints","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Deployment availability","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"PVC status","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Resource quota usage","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Recent events","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Actionable recommendations","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"For Pod Issues:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Get full pod details (status, events, conditions, resource config)\nkubectl describe pod \u003cpod-name> -n \u003cnamespace>\n\n# Check current and previous container logs\nkubectl logs \u003cpod-name> -n \u003cnamespace>\nkubectl logs \u003cpod-name> -n \u003cnamespace> --previous\n\n# Get events specific to the pod\nkubectl get events -n \u003cnamespace> --field-selector involvedObject.name=\u003cpod-name>","type":"text"}]},{"type":"paragraph","content":[{"text":"This reveals:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Pod phase and readiness","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Container statuses and states","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Restart counts and exit codes","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Recent events and scheduling decisions","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Resource requests and limits","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"For additional investigations:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Check all namespace events: ","type":"text"},{"text":"kubectl get events -n \u003cnamespace> --sort-by='.lastTimestamp'","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"4. Identify Root Cause","type":"text"}]},{"type":"paragraph","content":[{"text":"Consult references/common_issues.md for detailed information on:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"ImagePullBackOff / ErrImagePull","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"CrashLoopBackOff","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Pending Pods","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"OOMKilled","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Node issues (NotReady, DiskPressure)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Networking failures","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Storage/PVC issues","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Resource quotas and throttling","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"RBAC permission errors","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Each issue includes:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Symptoms","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Common causes","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Diagnostic commands","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Remediation steps","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Prevention strategies","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"5. Apply Remediation","type":"text"}]},{"type":"paragraph","content":[{"text":"Follow remediation steps from common_issues.md based on root cause identified.","type":"text"}]},{"type":"paragraph","content":[{"text":"Always:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Test fixes in non-production first if possible","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Document actions taken","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Monitor for effectiveness","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Have rollback plan ready","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"6. Verify & Monitor","type":"text"}]},{"type":"paragraph","content":[{"text":"After applying fix:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Verify issue is resolved","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Monitor for recurrence (15-30 minutes minimum)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Check related systems","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Update documentation","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Incident Response","type":"text"}]},{"type":"paragraph","content":[{"text":"For production incidents, follow structured response in references/incident_response.md:","type":"text"}]},{"type":"paragraph","content":[{"text":"Severity Assessment:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"SEV-1 (Critical): Complete outage, data loss, security breach","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"SEV-2 (High): Major degradation, significant user impact","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"SEV-3 (Medium): Minor impairment, workaround available","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"SEV-4 (Low): Cosmetic, minimal impact","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Incident Phases:","type":"text","marks":[{"type":"strong"}]}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Detection","type":"text","marks":[{"type":"strong"}]},{"text":" - Identify and assess","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Triage","type":"text","marks":[{"type":"strong"}]},{"text":" - Determine scope and impact","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Investigation","type":"text","marks":[{"type":"strong"}]},{"text":" - Find root cause","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Resolution","type":"text","marks":[{"type":"strong"}]},{"text":" - Apply fix","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Post-Incident","type":"text","marks":[{"type":"strong"}]},{"text":" - Document and improve","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Common Incident Scenarios:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Complete cluster outage","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Service degradation","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Node failure","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Storage issues","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Security incidents","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"See references/incident_response.md for detailed playbooks.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Quick Reference Commands","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Cluster Overview","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"kubectl cluster-info\nkubectl get nodes\nkubectl get pods --all-namespaces | grep -v Running\nkubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Pod Diagnostics","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"kubectl describe pod \u003cpod> -n \u003cnamespace>\nkubectl logs \u003cpod> -n \u003cnamespace>\nkubectl logs \u003cpod> -n \u003cnamespace> --previous\nkubectl exec -it \u003cpod> -n \u003cnamespace> -- /bin/sh\nkubectl get pod \u003cpod> -n \u003cnamespace> -o yaml","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Node Diagnostics","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"kubectl describe node \u003cnode>\nkubectl top nodes\nkubectl top pods --all-namespaces\nssh \u003cnode> \"systemctl status kubelet\"\nssh \u003cnode> \"journalctl -u kubelet -n 100\"","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Service & Network","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"kubectl describe svc \u003cservice> -n \u003cnamespace>\nkubectl get endpoints \u003cservice> -n \u003cnamespace>\nkubectl get networkpolicies --all-namespaces","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Storage","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"kubectl get pvc,pv --all-namespaces\nkubectl describe pvc \u003cpvc> -n \u003cnamespace>\nkubectl get storageclass","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Resource & Configuration","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"kubectl describe resourcequota -n \u003cnamespace>\nkubectl describe limitrange -n \u003cnamespace>\nkubectl get rolebindings,clusterrolebindings -n \u003cnamespace>","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Diagnostic Scripts","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"check_namespace.py","type":"text"}]},{"type":"paragraph","content":[{"text":"Namespace-level health check and diagnostics:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Pod health (running, pending, failed, crashlooping, image pull errors)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Service health and endpoints","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Deployment availability status","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"PersistentVolumeClaim status","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Resource quota usage and limits","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Recent namespace events","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Health status assessment","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Actionable recommendations","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Usage:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Human-readable output\npython3 scripts/check_namespace.py \u003cnamespace>\n\n# JSON output for automation\npython3 scripts/check_namespace.py \u003cnamespace> --json\n\n# Include more events\npython3 scripts/check_namespace.py \u003cnamespace> --events 20","type":"text"}]},{"type":"paragraph","content":[{"text":"Best used when troubleshooting issues in a specific namespace or assessing overall namespace health.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Cluster-Level Diagnostics (kubectl)","type":"text"}]},{"type":"paragraph","content":[{"text":"For cluster-wide health checks, use kubectl directly:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Node health and status\nkubectl get nodes\nkubectl top nodes\n\n# Find non-running pods across all namespaces\nkubectl get pods -A --field-selector status.phase!=Running\n\n# System pod health\nkubectl get pods -n kube-system","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Pod-Level Diagnostics (kubectl)","type":"text"}]},{"type":"paragraph","content":[{"text":"For detailed pod investigation, use kubectl directly:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Full pod details (status, events, conditions, resource config)\nkubectl describe pod \u003cpod-name> -n \u003cnamespace>\n\n# Current and previous container logs\nkubectl logs \u003cpod-name> -n \u003cnamespace>\nkubectl logs \u003cpod-name> -n \u003cnamespace> --previous\n\n# Events specific to the pod\nkubectl get events -n \u003cnamespace> --field-selector involvedObject.name=\u003cpod-name>","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Reference Documentation","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"references/common_issues.md","type":"text"}]},{"type":"paragraph","content":[{"text":"Comprehensive guide to common Kubernetes issues with:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Detailed symptom descriptions","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Root cause analysis","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Step-by-step diagnostic procedures","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Remediation instructions","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Prevention strategies","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Covers:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Pod issues (ImagePullBackOff, CrashLoopBackOff, Pending, OOMKilled)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Node issues (NotReady, DiskPressure)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Networking issues (pod-to-pod communication, service access)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Storage issues (PVC pending, volume mount failures)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Resource issues (quota exceeded, CPU throttling)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Security issues (vulnerabilities, RBAC)","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Read this when you identify a specific issue type but need detailed remediation steps.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"references/incident_response.md","type":"text"}]},{"type":"paragraph","content":[{"text":"Structured incident response framework including:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Incident response phases (Detection → Triage → Investigation → Resolution → Post-Incident)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Severity level definitions","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Detailed playbooks for common incident scenarios","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Communication guidelines","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Post-incident review template","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Best practices for prevention, preparedness, response, and recovery","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Read this when responding to production incidents or planning incident response procedures.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"references/performance_troubleshooting.md","type":"text"}]},{"type":"paragraph","content":[{"text":"Comprehensive performance diagnosis and optimization guide covering:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"High Latency Issues","type":"text","marks":[{"type":"strong"}]},{"text":" - API response time, request latency troubleshooting","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"CPU Performance","type":"text","marks":[{"type":"strong"}]},{"text":" - Throttling detection, profiling, optimization","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Memory Performance","type":"text","marks":[{"type":"strong"}]},{"text":" - OOM issues, leak detection, heap profiling","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Network Performance","type":"text","marks":[{"type":"strong"}]},{"text":" - Latency, packet loss, DNS resolution","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Storage I/O Performance","type":"text","marks":[{"type":"strong"}]},{"text":" - Disk performance testing, optimization","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Application-Level Metrics","type":"text","marks":[{"type":"strong"}]},{"text":" - Prometheus integration, distributed tracing","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Cluster-Wide Performance","type":"text","marks":[{"type":"strong"}]},{"text":" - Control plane, scheduler, resource utilization","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Read this when:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Investigating slow application response times","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Diagnosing CPU or memory performance issues","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Troubleshooting network latency or connectivity","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Optimizing storage I/O performance","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Setting up performance monitoring","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"references/helm_troubleshooting.md","type":"text"}]},{"type":"paragraph","content":[{"text":"Complete guide to Helm troubleshooting including:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Release Issues","type":"text","marks":[{"type":"strong"}]},{"text":" - Stuck releases, missing resources, state problems","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Installation Failures","type":"text","marks":[{"type":"strong"}]},{"text":" - Chart conflicts, validation errors, template rendering","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Upgrade and Rollback","type":"text","marks":[{"type":"strong"}]},{"text":" - Failed upgrades, immutable field errors, rollback procedures","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Values and Configuration","type":"text","marks":[{"type":"strong"}]},{"text":" - Values not applied, parsing errors, secret handling","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Chart Dependencies","type":"text","marks":[{"type":"strong"}]},{"text":" - Dependency updates, version conflicts, subchart values","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Hooks and Lifecycle","type":"text","marks":[{"type":"strong"}]},{"text":" - Hook failures, cleanup issues","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Repository Issues","type":"text","marks":[{"type":"strong"}]},{"text":" - Chart access problems, version mismatches","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Read this when:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Working with Helm-deployed applications","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Troubleshooting chart installations or upgrades","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Debugging Helm release states","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Managing chart dependencies","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"k8s-troubleshooter","author":"@skillopedia","source":{"stars":165,"repo_name":"devops-claude-skills","origin_url":"https://github.com/ahmedasmar/devops-claude-skills/blob/HEAD/k8s-troubleshooter/skills/SKILL.md","repo_owner":"ahmedasmar","body_sha256":"ed64c108a47f0f1c17ff77f8ba7d776862f21e422f3934d120fc9c06ee2664cf","cluster_key":"9598ccd16a235b48fd4d69111ae61e02ee7156ae56c09fe5edab99ca231e9c02","clean_bundle":{"format":"clean-skill-bundle-v1","source":"ahmedasmar/devops-claude-skills/k8s-troubleshooter/skills/SKILL.md","attachments":[{"id":"d436fa72-2f3b-53ce-af28-9a2835a6b779","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/d436fa72-2f3b-53ce-af28-9a2835a6b779/attachment.md","path":"references/common_issues.md","size":17302,"sha256":"e50967993f02670b8a73553068f629738dac6c57f120d1151ea8e65044da7e4d","contentType":"text/markdown; charset=utf-8"},{"id":"b327e450-4b8f-568b-a496-807241608b43","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/b327e450-4b8f-568b-a496-807241608b43/attachment.md","path":"references/helm_troubleshooting.md","size":16570,"sha256":"ff7f272643a36106c832321eff0f2665b2752f4417c4688823ebbdce857e0cb4","contentType":"text/markdown; charset=utf-8"},{"id":"68284545-b335-53f2-8aa2-39d457bbab0b","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/68284545-b335-53f2-8aa2-39d457bbab0b/attachment.md","path":"references/incident_response.md","size":10058,"sha256":"d4e24e32446952fa3d8918a8efc2efc5046f26b9fbd6e6b2a168bebf752ca402","contentType":"text/markdown; charset=utf-8"},{"id":"22aad688-0284-5552-af5b-182cfff2e400","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/22aad688-0284-5552-af5b-182cfff2e400/attachment.md","path":"references/performance_troubleshooting.md","size":15454,"sha256":"d661ca42ed4391317aa2430ff0cb35c8a12bd656d7e098f3f49f0baa44c4041f","contentType":"text/markdown; charset=utf-8"},{"id":"7cf68b54-d4a6-5779-83dd-8638461e7bd7","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/7cf68b54-d4a6-5779-83dd-8638461e7bd7/attachment.py","path":"scripts/check_namespace.py","size":17643,"sha256":"56fffe4def4e12b121f9dcf4e9b81533e7b937b23d8647e0a69d710472d9a2dd","contentType":"text/x-python; charset=utf-8"}],"bundle_sha256":"26c00b072b44034e0025d64961360a7e96bca3b7d5744a5d2ded3b21c1d915eb","attachment_count":5,"text_attachments":5,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":0,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":"k8s-troubleshooter/skills/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"devops-infrastructure","category_label":"DevOps"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"devops-infrastructure","import_tag":"clean-skills-v1","description":"Systematic Kubernetes troubleshooting and incident response. Use this skill whenever the user mentions Kubernetes, K8s, kubectl, pods, containers, or clusters. Triggers include diagnosing CrashLoopBackOff, ImagePullBackOff, OOMKilled, or Pending pods, responding to production incidents, troubleshooting node NotReady or DiskPressure, debugging service connectivity or networking, investigating PVC or storage failures, analyzing performance degradation, checking cluster health, troubleshooting Helm releases, and conducting post-incident reviews."}},"renderedAt":1782981856704}

Kubernetes Troubleshooter & Incident Response Systematic approach to diagnosing and resolving Kubernetes issues in production environments. Core Troubleshooting Workflow Follow this systematic approach for any Kubernetes issue: 1. Gather Context - What is the observed symptom? - When did it start? - What changed recently (deployments, config, infrastructure)? - What is the scope (single pod, service, node, cluster)? - What is the business impact (severity level)? 2. Initial Triage Run cluster health check: This provides an overview of: - Node health status - Pending and failed pods across all…