RAG Caching What to Cache Where | Layer | Key | Value | TTL | |---|---|---|---| | Embedding cache | | vector | infinite (deterministic) | | Retrieval cache | | list[chunk id] | minutes to hours | | Semantic query cache | normalized query embedding | final answer | minutes | | Rerank cache | | reordered ids | short | | Prompt cache (provider) | prefix tokens | decoder KV | provider-managed | | Web/upstream cache | URL | bytes | per Cache-Control | Rule: cache deterministic steps with stable inputs (embedding, exact retrieval) eagerly; cache probabilistic outputs (generation) only with semantic…