ml-inference-optimization

ML Inference Optimization Overview Use this skill for reducing inference latency, increasing throughput, shrinking memory, lowering cost, and deploying optimized models safely. Optimization must be benchmark-driven: define workload, input shapes, concurrency, SLOs, hardware, runtime, numerical tolerance, and quality metrics before changing the model. Optimization Workflow 1. Establish a baseline with realistic data, preprocessing, postprocessing, network overhead, warmup, and concurrency. 2. Profile bottlenecks: CPU preprocessing, model compute, memory bandwidth, GPU utilization, serializatio…