Runtime SLOs and Goodput

Runtime service objectives define useful work under latency, quality, availability, and safety constraints. Goodput counts work that satisfies the objective rather than all work attempted.

Key takeaways

Use separate objectives for queue, first token, output token, completion, and task outcome.
Admission control protects accepted work.
Agentic tasks need deadline, side-effect, recovery, and evidence objectives beyond model latency.

Definitions

SLI: Measured indicator such as TTFT, completion rate, or evidence gap.
SLO: Target for an SLI over a window and traffic class.
SLA: External commitment with consequences.
Goodput: Work completed while meeting the defined SLO and quality constraints.

Service objectives

Define queue delay, TTFT, TPOT or streaming cadence, full completion, timeout, availability, model quality, tool success, and safe failure by workload. Percentiles expose tail behavior hidden by averages.

Goodput

Raw tokens per second can rise while users experience slower or invalid results. SLO-constrained goodput counts only requests that meet latency, quality, and completion criteria. For agents, count completed workflows without unauthorized or duplicate effects.

Queueing and overload

Use bounded queues, deadline-aware admission, priority isolation, backpressure, and load shedding. Rejecting work early can improve system reliability compared with accepting requests that cannot meet their objective.

Traffic classes

Interactive chat, batch summarization, embeddings, coding agents, and high-impact approval workflows require different objectives and capacity reservations. Do not allow background prefill or evaluation work to starve latency-critical decode or safety operations.

Error budgets

Error budgets balance reliability and change. Include failures caused by overload, model errors, tools, policy, approvals, and evidence. Burn-rate alerts should connect to rollback, scaling, route restriction, or change freeze.

Task-level SLOs

Time to successful or safely terminated outcome
First-attempt and recovery-adjusted success
Approval wait and expiry
Unauthorized or duplicate side-effect rate
Evidence completeness and trace correlation
Cost per validated outcome

Measurement contract

An SLO is meaningful only when its population, start and stop points, exclusions, aggregation window, and failure treatment are defined. For streaming inference, distinguish queue delay, time to first token, time per output token, and completion. For agentic work, add deadline attainment, valid side-effect completion, approval wait, recovery, evidence persistence, and final task acceptance.

Correlate these signals through one request or workflow identifier. Do not remove retries or failed attempts from cost and capacity accounting merely because the final attempt succeeded. Goodput should count only work that satisfies the declared latency, quality, policy, and evidence conditions.

SLO anti-patterns

Reporting average latency while tail requests violate user deadlines.
Counting generated tokens as success when the task output is invalid.
Measuring only accepted traffic and hiding rejected or shed work.
Combining unlike workloads into one percentile.
Resetting the clock after a retry or route fallback.
Ignoring approval, tool, evidence, and recovery time in task completion.

Find runtime definitions and implementation guidance