Batch Inference for RAG Batch APIs cut cost 50% and unlock much higher throughput than the real-time path. Use them wherever a 24-hour SLA is acceptable: initial ingest, re-embed after model swap, bulk metadata extraction, nightly eval runs, large-scale query replay. When to Batch vs Stream | Workload | Batch? | |---|---| | Initial corpus embedding (millions of chunks) | Yes | | Re-embed after model swap | Yes | | Nightly eval run on golden set | Yes | | Extracting entities/summaries/keywords during ingest | Yes | | User-facing chat query | No | | Real-time retrieval embedding (single query)…