Key takeaways
- Scale-to-zero saves idle cost only when cold-start and model-loading latency fit the workload.
- Large weights and compiled engines challenge ordinary function packaging and ephemeral storage.
- Burst elasticity still requires bounded downstream model capacity and backpressure.
- Durable agent workflows should not rely on one long invocation.
- MicroVM isolation improves boundaries but does not replace application authorization or artifact provenance.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Event/request, function or container image, model reference, ephemeral limits, scaling policy, identity, and timeout.
Owns
Invocation isolation, startup path, concurrency scaling, ephemeral lifecycle, and integration with durable state.
Emits
Result or workflow event, invocation telemetry, cold/warm state, storage operations, and retry status.
Does not own
Infinite accelerator supply, durable progress by default, or safe retries of side effects.
Failure modes
Cold-start timeout, model download storm, ephemeral storage exhaustion, duplicate invocation, downstream overload, and lost state.
Evidence and metrics
Cold/warm start, image/model load, invocation duration, concurrency, throttles, retries, cost, and downstream Goodput.
Cold-start lifecycle
Cold start can include scheduling, microVM/container creation, image pull, runtime init, model download, verification, compilation, loading, allocation, and warmup.
Implementation
Instrument every phase and state which caches/snapshots are warm.
Operational implications
A single startup number hides where optimization is possible.
Measure
Phase latency, cold frequency, failure, first-request delta, and ready time.
Packaging and storage
Function bundles are poorly suited to very large weights and compiled engines.
Implementation
Use immutable registries/object storage, shared or node-local caches, hashes, disk checks, and staggered loading.
Operational implications
Synchronized scale-out can overwhelm storage and network.
Measure
Image/model bytes, cache hit, download time, hash failure, and ephemeral disk.
Warm pools and snapshots
Provisioned concurrency, warm containers, model snapshots, or resident backing services reduce startup.
Implementation
Define minimum warm capacity and snapshot compatibility; include idle cost and invalidation.
Operational implications
Warm pools trade cost for latency and can preserve stale runtime/model state.
Measure
Warm hit, idle cost, snapshot restore, version mismatch, and scale time.
Burst and backpressure
A serverless frontend can create concurrency faster than a scarce model backend can serve.
Implementation
Limit fan-out, queue with deadlines, propagate overload, and coordinate with model-service admission.
Operational implications
Otherwise elasticity becomes hidden queueing, retries, and cost.
Measure
Invocation concurrency, downstream queue, throttle, retry, Goodput, and cost.
Durability and idempotency
Function platforms retry events and terminate invocations.
Implementation
Store state externally, use idempotency/deduplication, checkpoint long tasks, and query authoritative result after ambiguous timeout.
Operational implications
Irreversible writes must never rely on at-most-once assumptions.
Measure
Duplicate prevented, ambiguous outcome, resume, retry, compensation, and timeout.
Serverless fit
The pattern works well for small event models, preprocessing, asynchronous enrichment, periodic evaluation, and orchestration around a resident model service.
Implementation
Choose persistent serving for large models, sustained load, KV state, or tight streaming.
Operational implications
Use requirements and full cold-start evidence rather than fashion.
Measure
Duty cycle, cold budget, model residency, Goodput, cost, and task success.
Reference tables
| Workload | Fit | Reason |
|---|---|---|
| Small event classifier | Strong | Short execution and modest artifact |
| Large LLM endpoint | Weak unless specialized service | Weights, warmup, KV state, sustained capacity |
| Document enrichment | Moderate-strong | Asynchronous and bursty |
| Long-running agent | Workflow plus functions | Requires durable state |
| Periodic evaluation | Strong | Scheduled bounded jobs |
| Hybrid routing gateway | Moderate | Routing can be elastic; backend is bounded |
Decision checklist
- What is the full cold-start path and percentile budget?
- Can the artifact fit package and ephemeral limits?
- Where will weights be cached and verified?
- What downstream limits burst fan-out?
- Are invocations and side effects idempotent?
- Where is durable workflow state stored?
- Does scale-to-zero save enough to justify startup latency?
Common mistakes
- Calling scheduling delay alone cold start.
- Downloading the model independently in every burst instance.
- Allowing serverless concurrency to overwhelm one GPU service.
- Keeping long-running agent state only in process memory.
- Retrying writes without idempotency.
- Choosing scale-to-zero for a continuously busy resident model.
Sources and further reading
-
Knative Serving
(opens in a new tab)
-
Firecracker microVM
(opens in a new tab)
-
Temporal documentation
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
