Runtime service objectives define useful work under latency, quality, availability, and safety constraints. Goodput counts work that satisfies the objective rather than all work attempted.
Key takeaways
- Use separate objectives for queue, first token, output token, completion, and task outcome.
- Admission control protects accepted work.
- Agentic tasks need deadline, side-effect, recovery, and evidence objectives beyond model latency.
Definitions
- SLI
- Measured indicator such as TTFT, completion rate, or evidence gap.
- SLO
- Target for an SLI over a window and traffic class.
- SLA
- External commitment with consequences.
- Goodput
- Work completed while meeting the defined SLO and quality constraints.
Service objectives
Define queue delay, TTFT, TPOT or streaming cadence, full completion, timeout, availability, model quality, tool success, and safe failure by workload. Percentiles expose tail behavior hidden by averages.
Goodput
Raw tokens per second can rise while users experience slower or invalid results. SLO-constrained goodput counts only requests that meet latency, quality, and completion criteria. For agents, count completed workflows without unauthorized or duplicate effects.
Queueing and overload
Use bounded queues, deadline-aware admission, priority isolation, backpressure, and load shedding. Rejecting work early can improve system reliability compared with accepting requests that cannot meet their objective.
Traffic classes
Interactive chat, batch summarization, embeddings, coding agents, and high-impact approval workflows require different objectives and capacity reservations. Do not allow background prefill or evaluation work to starve latency-critical decode or safety operations.
Error budgets
Error budgets balance reliability and change. Include failures caused by overload, model errors, tools, policy, approvals, and evidence. Burn-rate alerts should connect to rollback, scaling, route restriction, or change freeze.
Task-level SLOs
- Time to successful or safely terminated outcome
- First-attempt and recovery-adjusted success
- Approval wait and expiry
- Unauthorized or duplicate side-effect rate
- Evidence completeness and trace correlation
- Cost per validated outcome
Measurement contract
An SLO is meaningful only when its population, start and stop points, exclusions, aggregation window, and failure treatment are defined. For streaming inference, distinguish queue delay, time to first token, time per output token, and completion. For agentic work, add deadline attainment, valid side-effect completion, approval wait, recovery, evidence persistence, and final task acceptance.
Correlate these signals through one request or workflow identifier. Do not remove retries or failed attempts from cost and capacity accounting merely because the final attempt succeeded. Goodput should count only work that satisfies the declared latency, quality, policy, and evidence conditions.
SLO anti-patterns
- Reporting average latency while tail requests violate user deadlines.
- Counting generated tokens as success when the task output is invalid.
- Measuring only accepted traffic and hiding rejected or shed work.
- Combining unlike workloads into one percentile.
- Resetting the clock after a retry or route fallback.
- Ignoring approval, tool, evidence, and recovery time in task completion.
