Model Pruning: Compressing LLMs When to Use This Skill Use Model Pruning when you need to: - Reduce model size by 40-60% with <1% accuracy loss - Accelerate inference using hardware-friendly sparsity (2-4× speedup) - Deploy on constrained hardware (mobile, edge devices) - Compress without retraining using one-shot methods - Enable efficient serving with reduced memory footprint Key Techniques : Wanda (weights × activations), SparseGPT (second-order), structured pruning, N:M sparsity Papers : Wanda ICLR 2024 (arXiv 2306.11695), SparseGPT (arXiv 2301.00774) Installation Quick Start Wanda Prunin…