LLM Inference Scaling Scale LLM inference horizontally on Kubernetes with GPU-aware autoscaling, request queuing, and cost-efficient spot instance strategies. When to Use This Skill Use this skill when: - LLM API traffic is unpredictable and you need to scale up/down automatically - Managing a fleet of vLLM or TGI inference pods on Kubernetes - Reducing inference costs with spot/preemptible GPU instances - Implementing queue-based autoscaling for batch inference jobs - Building a multi-model serving platform that shares GPU resources Prerequisites - Kubernetes cluster with GPU nodes (NVIDIA o…