Key takeaways
- Centralized accelerators support large models and shared capacity but require admission, isolation, and cost attribution.
- Managed services reduce platform ownership but add provider limits, version cadence, and lock-in.
- Private clusters increase control while moving driver, scheduler, network, security, and upgrade responsibility in-house.
- Autoscaling must include model download, engine build, cache warmup, and scarce accelerator availability.
- Multi-region designs require model consistency, affinity, capacity, residency, and explicit failover.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Model artifacts, cluster capacity, runtime images, requests, placement policy, tenant quotas, residency rules, and scaling signals.
Owns
Cluster placement, network/service boundary, shared capacity, tenancy isolation, image/runtime lifecycle, and regional recovery.
Emits
Hosted endpoints, routes, utilization/cost telemetry, lifecycle events, and resilience evidence.
Does not own
Application authorization or proof that a managed service meets every governance requirement.
Failure modes
Accelerator shortage, queue overload, driver/runtime skew, region outage, network bottleneck, noisy neighbor, and runaway cost.
Evidence and metrics
Goodput, utilization, queue, model residency, scale-up, network transfer, cost, tenant share, and failover recovery.
Self-managed infrastructure
Bare metal or Kubernetes provides control over accelerators, networking, storage, drivers, security, and runtime selection.
Implementation
Maintain device plugins/operators, immutable images, node pools, registry/cache, network policy, and tested upgrade rings.
Operational implications
Control is valuable for private or specialized workloads but requires deep platform capability.
Measure
Node readiness, driver/runtime skew, utilization, queue, rollout, and incident load.
Managed endpoints and platforms
Providers abstract nodes, scaling, runtime, or model APIs.
Implementation
Record provider, region, release/model version, quota, privacy boundary, observability, pricing date, and fallback.
Operational implications
The provider owns more infrastructure; the application still owns data policy, quality, authorization, and downstream effects.
Measure
Provider latency, throttle, error, quota, cost, route, and release changes.
Capacity and warm pools
Accelerator provisioning and model readiness can take minutes.
Implementation
Forecast baseline capacity, maintain warm pools for burst SLOs, and use queue/active-token signals.
Operational implications
Reactive scaling can arrive after requests time out; scale-down can discard valuable cache and residency.
Measure
Time to capacity, warm idle cost, queue during scale, cache lost, and scale oscillation.
Multi-tenancy
Tenants share APIs, servers, models, accelerators, caches, and telemetry.
Implementation
Enforce identity, model access, quotas, cache scope, temporary storage, admin RBAC, network policy, and trace access.
Operational implications
Container boundaries alone do not isolate application data or cache.
Measure
Tenant Goodput/queue, quota denies, cross-tenant alerts, cache share, and cost attribution.
Networking and storage
Models, shards, requests, KV state, telemetry, and control traffic use different paths and performance requirements.
Implementation
Separate control/data planes, optimize model distribution, use private links where required, and measure transfer tails.
Operational implications
Network and storage bottlenecks can make idle accelerators appear underutilized.
Measure
Model load bandwidth, request/response bytes, collective/KV transfer, storage errors, and egress.
Resilience and failover
Availability spans replica, node, zone, region, repository, and control-plane failures.
Implementation
Keep immutable artifacts/config, readiness-aware routing, tested rollback, capacity reserve, and residency-aware failover.
Operational implications
Cross-region failover changes latency, data residency, cost, and available models.
Measure
Recovery objective, failover route, capacity after failure, data-boundary compliance, and rollback.
Compatibility and upgrades
Drivers, firmware, libraries, runtime, container, model artifact, and hardware must be tested together.
Implementation
Use canary node pools, compatibility tests, workload benchmarks, and known-good rollback.
Operational implications
Independent upgrades can change numerics, memory, kernels, or model loading.
Measure
Upgrade pass, regression, version skew, rollback time, and support window.
Cost and utilization
Central clusters trade high fixed/idle cost against efficient sharing.
Implementation
Measure cost per successful request, warm reserve, network/storage, tool/provider calls, observability, and failure capacity.
Operational implications
High utilization can reduce tail SLOs; low utilization can be intentional resilience headroom.
Measure
Cost/success, utilization, Goodput/device, idle reserve, egress, and unallocated cost.
Reference tables
| Choice | Control | Operational burden | Primary dependency |
|---|---|---|---|
| Self-managed bare metal | Highest | Highest | Hardware, drivers, scheduler, network |
| Self-managed Kubernetes | High | High | Cluster platform and GPU operators |
| Managed Kubernetes | Moderate-high | Moderate | Provider node supply/platform versions |
| Managed inference endpoint | Lower | Lower | Provider runtime, quotas, APIs, pricing |
| Hybrid private/managed | Selective | Highest integration burden | Routing, identity, policy, parity |
Decision checklist
- What provider or private-control boundary is required?
- How long does scale-up take from zero usable capacity?
- Which tenant, cache, and network isolation controls apply?
- What data-residency and cross-region rules apply?
- How is accelerator scarcity handled?
- What compatibility matrix governs upgrades?
- How are cost and capacity attributed?
Common mistakes
- Autoscaling on CPU while accelerator queues grow.
- Assuming managed service removes governance responsibility.
- Failing over across regions without checking residency/capacity.
- Co-locating tenants without cache/telemetry isolation.
- Scaling down warm workers solely on utilization.
- Upgrading drivers and runtime independently.
Sources and further reading
-
Kubernetes device plugins
(opens in a new tab)
-
KServe architecture
(opens in a new tab)
-
Triton Inference Server
(opens in a new tab)
-
Ray Serve production guide
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
