vllm-server — Skillopedia

vLLM Server Management Deploy production-grade LLM inference servers with vLLM — the fastest open-source LLM serving engine with PagedAttention and continuous batching. When to Use This Skill Use this skill when: - Serving open-source LLMs (Llama, Mistral, Qwen, Gemma) at scale - Building an OpenAI-compatible API endpoint for self-hosted models - Optimizing LLM throughput and latency for production traffic - Running multi-GPU inference with tensor or pipeline parallelism - Deploying quantized models to reduce GPU memory requirements Prerequisites - NVIDIA GPU(s) with CUDA 12.1+ (A100/H100 rec…