llama.cpp Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware. When to use llama.cpp Use llama.cpp when: - Running on CPU-only machines - Deploying on Apple Silicon (M1/M2/M3/M4) - Using AMD or Intel GPUs (no CUDA) - Edge deployment (Raspberry Pi, embedded systems) - Need simple deployment without Docker/Python Use TensorRT-LLM instead when: - Have NVIDIA GPUs (A100/H100) - Need maximum throughput (100K+ tok/s) - Running in datacenter with CUDA Use vLLM instead when: - Have NVIDIA GPUs - Need Python-first API - Want PagedAttention Quick start Install…