coreweave-observability

CoreWeave Observability Overview CoreWeave runs GPU-intensive workloads on Kubernetes where hardware failures, memory exhaustion, and underutilization directly impact cost and reliability. Observability must cover DCGM GPU metrics, Kubernetes pod health, inference latency, and job completion rates. Proactive monitoring prevents wasted spend on idle GPUs and catches OOM conditions before they cascade. Key Metrics | Metric | Type | Target | Alert Threshold | |--------|------|--------|-----------------| | GPU utilization | Gauge | 60% | < 20% for 30m | | GPU memory usage | Gauge | < 85% | 95% fo…