gpu-kubernetes-operations

GPU Kubernetes Operations Run resilient and cost-efficient GPU clusters for production AI workloads. When to Use This Skill - Setting up GPU node pools in Kubernetes for AI inference or training - Configuring NVIDIA device plugin and GPU operator - Implementing MIG partitioning to share GPUs across workloads - Building GPU-aware autoscaling policies - Monitoring GPU health with DCGM and Prometheus - Troubleshooting GPU scheduling, driver, or OOM issues Prerequisites - Kubernetes 1.28+ cluster with GPU-capable nodes - NVIDIA GPUs (A10, L4, A100, H100, or similar) - NVIDIA drivers installed on…