tensorrt-llm — Skillopedia

TensorRT-LLM NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs. When to use TensorRT-LLM Use TensorRT-LLM when: - Deploying on NVIDIA GPUs (A100, H100, GB200) - Need maximum throughput (24,000+ tokens/sec on Llama 3) - Require low latency for real-time applications - Working with quantized models (FP8, INT4, FP4) - Scaling across multiple GPUs or nodes Use vLLM instead when: - Need simpler setup and Python-first API - Want PagedAttention without TensorRT compilation - Working with AMD GPUs or non-NVIDIA hardware Use llama.cpp instead wh…