Shielda — Service Level Objectives (SLOs)

Owner: SRE Last reviewed: 2026-04-23 Review cadence: Quarterly (next: 2026-07-23) Plan reference: PRODUCTION_READINESS_PLAN.md §Phase 5 · docs/audit/SCALE_REPORT.md Alerting: obser...

Owner: SRE Last reviewed: 2026-04-23 Review cadence: Quarterly (next: 2026-07-23) Plan reference: PRODUCTIONREADINESSPLAN.md §Phase 5 · docs/audit/SCALEREPORT.md Alerting: observability/grafana/provisioning/alerting/rules.yaml

This document is the single source of truth for Shielda's external-facing SLOs, their measurement queries, their error-budget policy, and the response to budget burn.

---

SLO catalogue

SLI SLO (rolling 30 days) Target Measurement 1 API availability (requests − 5xx) / requests ≥ 99.9 % 1 - (sum(rate(shieldahttprequeststotal{status=~"5.."}[30d])) / sum(rate(shieldahttprequeststotal[30d]))) 2 API read p95 latency Share of read requests served in < 500 ms ≥ 99 % sum(rate(shieldahttprequestdurationsecondsbucket{method="GET",le="0.5"}[30d])) / sum(rate(shieldahttprequestdurationsecondscount{method="GET"}[30d])) 3 API write p95 latency Share of write requests served in < 1 s ≥ 99 % Same, with method!="GET" and le="1" 4 Heartbeat p95 latency Share of heartbeats served in < 150 ms ≥ 99 % Same, scoped route="./agents/heartbeat", le="0.15" 5 Ingest p95 latency Share of scan-ingest requests in < 500 ms ≥ 99 % Same, scoped route="./scans.", le="0.5" 6 Counselor SSE session success Sessions that close without 5xx ≥ 99.5 % 1 - (sum(rate(shieldassesessionsfailedtotal[30d])) / sum(rate(shieldassesessionsstartedtotal[30d]))) 7 Agent online ratio Share of agents reporting heartbeat within 10 min ≥ 99 % shieldaagentsonline / shieldaagentstotal 8 LLM request success Share of LLM calls that return 2xx ≥ 99 % sum(rate(shieldallmrequestdurationsecondscount{status=~"2.."}[30d])) / sum(rate(shieldallmrequestdurationsecondscount[30d]))

Each SLI is emitted by the Prometheus endpoint /api/metrics (registry: lib/metrics/registry.ts) and scraped by Grafana Agent / Alloy.

---

Error budget

Budget size. SLO 1 at 99.9 % over 30 days allows 43 min 49 s of 5xx downtime per month. Burn window. Budget is evaluated against a rolling 30-day window, refreshed hourly. Multi-SLI rule. If any SLO breaches its target, the corresponding budget class is "in the red". For GA we track SLOs 1–5 as customer-critical; SLOs 6–8 as product-critical.

2.1 Burn-rate alert rules

Fast burn (2 % of 30-day budget in 1 h) and slow burn (10 % of 30-day budget in 6 h) alerts are configured in rules.yaml. When either fires:

Burn class Trigger Action Fast 14.4× normal Page on-call immediately; incident channel opened Slow 3× normal Warning in shielda-sre-alerts; diagnose in next business hour

2.2 Budget consumption policy

Budget consumed (rolling 30d) Status Deploy policy ------------------------------:----------------------- < 50 % ✅ Healthy No restrictions 50 – 75 % 🟡 Caution Non-emergency deploys require peer review; feature flags mandatory 75 – 90 % 🟠 Risk Freeze non-emergency deploys; SRE sign-off required; postmortem on every new regression 90 % 🔴 Exhausted Stop all deploys except SLO-recovery fixes; engineering must present a recovery plan to leadership within 24 h

This policy is enforced by the release workflow via a pre-deploy probe that fetches the current burn rate and fails fast if status is 🟠 or 🔴.

---

Review cadence

Quarterly — Full SLO review by SRE + product. Adjust targets if (a) users have consistently better experience than the SLO requires (raise), or (b) the SLO is chronically violated with no user impact (lower). Per-incident — Any postmortem for a Sev-1 / Sev-2 incident must include a "did this change our SLO posture?" item. Annually — Refresh the SLI catalogue against the current product surface. Retire SLIs whose signal we no longer emit; introduce new SLIs for newly-GA surfaces.

Review log lives in docs/audit/SLOREVIEWLOG.md (next quarterly entry scheduled 2026-07-23).

---

Operational links

Metrics endpoint (bearer-auth): /api/metrics. Registry source: lib/metrics/registry.ts. Dashboards: observability/grafana/dashboards/ — ingest, agent-fleet, counselor-llm, database, fixes-remediation, security-overview. Alert rules: rules.yaml. Runbook: RUNBOOK.md. DR: DISASTERRECOVERY.md · Q2 drill log DRDRILLLOG.md (RTO 3 h 12 min · RPO 9 min). IR tabletop: IRTABLETOPLOG.md.

---

Glossary

SLI — Service Level Indicator. The measurement (e.g. "fraction of requests < 500 ms"). SLO — Service Level Objective. The target for an SLI (e.g. "≥ 99 %"). SLA — Service Level Agreement. The contractual commitment backed by an SLO. Shielda's customer-facing SLA is in docs/legal/ and is intentionally more lenient than these internal SLOs so that SLO breaches give us slack before SLA breaches. Error budget. 1 − SLO per window. Spending is allowed; exhaustion triggers deploy freeze. Burn rate. How much faster than the 30-day target we are spending budget right now. 1× = on track; 14.4× = fast-burn pager.