TEI, Triton, and NIM for Embedding/Reranker Serving Embeddings and rerankers are the hottest-path CPU/GPU work in RAG. Serving them with in-process or behind a plain Flask/FastAPI app leaves 3–10x throughput on the table. Three production paths: 1. TEI (Text Embeddings Inference) — HuggingFace's purpose-built server for sentence-transformers/BGE/E5 and cross-encoder rerankers. Rust core, dynamic batching, ONNX + FP16 + CUDA kernels. Easiest win. 2. NVIDIA Triton — general model server. Use when you need to co-serve embedding + reranker + small LLM (or ASR/CV models) behind one gateway, with e…