Benchmark Sandbox — Remote Eval via Vercel Sandboxes Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a 3-phase eval pipeline : - Phase 1 (BUILD) : Claude Code builds the app with - Phase 2 (VERIFY) : A follow-up Claude Code session uses to walk through user stories, fixing issues until all pass (20 min timeout) - Phase 3 (DEPLOY) : A third Claude Code session links to vercel-labs, runs , and fixes build errors (up to 3 retries). D…