Vast.ai Incident Runbook Overview Rapid incident response procedures for Vast.ai GPU instance failures. Covers triage, mitigation, recovery, and postmortem for common incident types: spot preemption, instance crashes, GPU failures, and billing issues. Prerequisites - Vast.ai CLI access - SSH access to instances (if still running) - Checkpoint storage accessible (S3/GCS) Instructions Triage: Assess Impact (< 2 minutes) Incident Type 1: Spot Preemption Symptoms : Instance status changes from to or without user action. Incident Type 2: Training Job Crash Symptoms : Instance running but training…