Cloud and data-center deployments optimize shared accelerators, elasticity, multi-model serving, distributed inference, and centralized operations.
Key takeaways
- Model-serving clusters
- Capacity and quota must be explicit.
- Fallback and rollback behavior should be tested.
Patterns
- Model-serving clusters
- Disaggregated prefill/decode
- Multi-tier cache
- Managed serving platforms
- Confidential compute where required
Placement decision
| Question | Why it matters |
|---|---|
| Capacity and quota | Record the constraint, assumption, and accepted trade-off in the Runtime Decision Record. |
| Fabric topology | Record the constraint, assumption, and accepted trade-off in the Runtime Decision Record. |
| Multi-tenancy | Record the constraint, assumption, and accepted trade-off in the Runtime Decision Record. |
| Autoscaling lag | Record the constraint, assumption, and accepted trade-off in the Runtime Decision Record. |
| Cost attribution | Record the constraint, assumption, and accepted trade-off in the Runtime Decision Record. |
| Provider and region failure | Record the constraint, assumption, and accepted trade-off in the Runtime Decision Record. |
Failure and fallback
Define behavior for network loss, provider failure, device pressure, cache loss, invalid artifacts, and unavailable tools. A fallback must preserve data policy and output contracts; it should not silently broaden authority.
Implementation checklist
- Document the control, execution, data, and evidence locations.
- Pin artifact, runtime, and policy versions.
- Test cold start, steady state, overload, failure, and rollback.
- Expose data movement and hosted fallback to users where relevant.
- Record cost and capacity assumptions.
Control and tenancy boundaries
Shared infrastructure needs separate identities for the human user, application, workload, model deployment, tool adapter, and operator. Tenant context should be bound at admission and propagated through queues, cache keys, tool calls, traces, and billing records. Do not infer tenancy from prompt text or a client-controlled label. Cache reuse and batching require explicit isolation rules so optimization does not become a data-leak path.
The control plane should version model repositories, routing policy, deployment configuration, quotas, and rollout state. Execution workers should report the applied versions and remain replaceable. Durable request state, approval state, and evidence should not depend on one pod or accelerator process surviving.
Overload and multi-tenant failure
- Use admission control before queues become unbounded.
- Separate traffic classes with explicit latency and cost objectives.
- Protect decode work from long prefill or batch interference where the workload requires it.
- Define cache-eviction and model-residency behavior under pressure.
- Test region, provider, network, registry, queue, cache, and evidence-service outages.
- Expose rejected, shed, rerouted, timed-out, and degraded work as distinct outcomes.
