turboquant-pytorch — Skillopedia

TurboQuant PyTorch Skill by ara.so — Daily 2026 Skills collection. From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for compressing LLM KV caches. Achieves 5x compression at 3-bit with 99.5% attention fidelity via two-stage vector quantization. What It Does TurboQuant compresses LLM key-value caches to 2–4 bits per coordinate: - Stage 1 : Random orthogonal rotation + Lloyd-Max scalar quantization (MSE-optimal) - Stage 2 : QJL residual correction — 1-bit sign projection that makes inner product estimates unbiased Result: attention scores remain accurate even when individu…