Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

ARuntime Reference

Runtime SLOs and Goodput

Goodput measures useful completions that satisfy defined latency and quality objectives. It prevents raw throughput from hiding SLO violations or failed work.

Audience: Technical readers Reading time: 3 minutes Status: Production guidance Last reviewed:

Runtime service objectives define useful work under latency, quality, availability, and safety constraints. Goodput counts work that satisfies the objective rather than all work attempted.

Key takeaways

  • Use separate objectives for queue, first token, output token, completion, and task outcome.
  • Admission control protects accepted work.
  • Agentic tasks need deadline, side-effect, recovery, and evidence objectives beyond model latency.

Definitions

SLI
Measured indicator such as TTFT, completion rate, or evidence gap.
SLO
Target for an SLI over a window and traffic class.
SLA
External commitment with consequences.
Goodput
Work completed while meeting the defined SLO and quality constraints.

Service objectives

Define queue delay, TTFT, TPOT or streaming cadence, full completion, timeout, availability, model quality, tool success, and safe failure by workload. Percentiles expose tail behavior hidden by averages.

Goodput

Raw tokens per second can rise while users experience slower or invalid results. SLO-constrained goodput counts only requests that meet latency, quality, and completion criteria. For agents, count completed workflows without unauthorized or duplicate effects.

Queueing and overload

Use bounded queues, deadline-aware admission, priority isolation, backpressure, and load shedding. Rejecting work early can improve system reliability compared with accepting requests that cannot meet their objective.

Traffic classes

Interactive chat, batch summarization, embeddings, coding agents, and high-impact approval workflows require different objectives and capacity reservations. Do not allow background prefill or evaluation work to starve latency-critical decode or safety operations.

Error budgets

Error budgets balance reliability and change. Include failures caused by overload, model errors, tools, policy, approvals, and evidence. Burn-rate alerts should connect to rollback, scaling, route restriction, or change freeze.

Task-level SLOs

  • Time to successful or safely terminated outcome
  • First-attempt and recovery-adjusted success
  • Approval wait and expiry
  • Unauthorized or duplicate side-effect rate
  • Evidence completeness and trace correlation
  • Cost per validated outcome

Measurement contract

An SLO is meaningful only when its population, start and stop points, exclusions, aggregation window, and failure treatment are defined. For streaming inference, distinguish queue delay, time to first token, time per output token, and completion. For agentic work, add deadline attainment, valid side-effect completion, approval wait, recovery, evidence persistence, and final task acceptance.

Correlate these signals through one request or workflow identifier. Do not remove retries or failed attempts from cost and capacity accounting merely because the final attempt succeeded. Goodput should count only work that satisfies the declared latency, quality, policy, and evidence conditions.

SLO anti-patterns

  • Reporting average latency while tail requests violate user deadlines.
  • Counting generated tokens as success when the task output is invalid.
  • Measuring only accepted traffic and hiding rejected or shed work.
  • Combining unlike workloads into one percentile.
  • Resetting the clock after a retry or route fallback.
  • Ignoring approval, tool, evidence, and recovery time in task completion.

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.