awq-quantization — Skillopedia

AWQ (Activation-aware Weight Quantization) 4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss. When to use AWQ Use AWQ when: - Need 4-bit quantization with <5% accuracy loss - Deploying instruction-tuned or chat models (AWQ generalizes better) - Want 2.5-3x inference speedup over FP16 - Using vLLM for production serving - Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support Use GPTQ instead when: - Need maximum ecosystem compatibility (more tools support GPTQ) - Working with ExLlamaV2 backend specifical…