Back to home
ArchitectureMar 202610 min read

How we provision agent VMs in under 30 seconds

A deep dive into our pre-warming pool, snapshot-based provisioning, block-level lazy loading, and the networking layer that makes sub-30-second deploys possible at scale.

Speed is everything when deploying AI agents. If deploying a new iteration of your agent takes 10 minutes due to heavy Docker image builds and pull times, your development feedback loop dies. We set a seemingly impossible engineering requirement early on: deploying a brand new agent VM, complete with gigabytes of model weights and custom tools, must happen in under 30 seconds.

The Cold Boot Problem

Traditional VMs take minutes to spin up, initialize the kernel, and boot user-space. Even lightweight containers can be agonizingly slow when they are forced to pull multi-gigabyte LLM model weights (like Llama 3 70B or Hermes) from blob storage before they can serve a single request.

Here is exactly how we completely bypassed the traditional cold boot overhead stack:

  • Pre-warmed pools: We maintain a massive pre-warmed pool of microVMs powered by Firecracker across our edge regions, held in a paused state.
  • Memory Snapshotting: The model weights for our top-tier tier-1 models are already loaded tightly into memory snapshots. The KV-cache is pre-allocated.
  • Virtio-fs Block Lazy Loading: For custom user tools, python dependencies, and smaller arbitrary model weights, we utilize a block-level lazy-loading system via virtio-fs. The VM boots instantly, and the filesystem literally fetches chunks over the network only when the Linux kernel requests them.

Achieving 28s P99 Deployments

By bridging these technologies together, a `deploy()` call doesn't build a machine from scratch. It doesn't even boot a kernel. It essentially resumes a paused MicroVM from a generic memory state, stitches in your specific API keys via a secure metadata service, injects your tool definitions via the virtual filesystem, and attaches an active ENI (Elastic Network Interface) to our global overlay network.

The result? 28 seconds from the API HTTP request to an active, stream-ready WebSocket connection, anywhere in the world.