Shielda Performance Benchmarks
Last updated: 2025-01-20 — Baseline benchmarks for capacity planning and regression tracking.
Last updated: 2025-01-20 — Baseline benchmarks for capacity planning and regression tracking.
---
Table of Contents
Infrastructure Baseline API Response Times Agent Capacity & Throughput Findings Ingestion Throughput Database Performance at Scale Scan Time Benchmarks Rate Limiting Thresholds Known Bottlenecks & Optimization Opportunities Load Test Script Monitoring & Observability
---
Infrastructure Baseline
Production (AWS ECS Fargate)
Resource Value CPU 1024 units (1 vCPU) Memory 2048 MiB Desired count 2 tasks Autoscale max 10 tasks CPU scaling target 70% Memory scaling target 80%
Database (RDS PostgreSQL)
Setting Value Instance class Per Terraform var Connection pool size 20 per instance Statement timeout 30 s Idle timeout 10 s Max lifetime 1800 s (30 min) Prepared statements Enabled Backup retention 14 days Multi-AZ Yes
Total Schema Scale
Metric Count Tables 98 Indexes ~233 Unique idx 31 API routes 189
---
API Response Times
Target SLAs for key endpoints (measured at p50 / p95 / p99):
Endpoint Method Target p50 Target p95 Target p99 Notes /api/agents/heartbeat POST < 50 ms < 150 ms < 300 ms Hot path: every agent polls every 30s /api/agents/scans/results POST < 200 ms < 500 ms < 1 s Up to 500 signals per batch /api/agents/verdicts POST < 200 ms < 500 ms < 1 s Up to 500 verdicts per batch /api/findings (list) GET < 100 ms < 300 ms < 500 ms Paginated (max 200), severity counts /api/dashboard GET < 150 ms < 500 ms < 1 s Multiple aggregate queries /api/scans (create) POST < 100 ms < 300 ms < 500 ms Queues task, doesn't wait for scan /api/compliance-docs (generate) POST < 5 s < 15 s < 30 s LLM-backed document generation Scheduled crons (tick) POST < 2 s < 5 s < 10 s Batch processing per invocation
How to Measure
API response times are captured via: Sentry Performance Monitoring — server traces at 10% sample rate (production) Agent Prometheus metrics — shieldaagenttooldurationms histogram Load test script — see Section 9
---
Agent Capacity & Throughput
Per-Agent Limits
Parameter Value Source Heartbeat interval 30 seconds Go agent time.NewTicker Offline detection timeout 90 s (config) / 300 s (cron) Configurable Max parallel Docker tools 4 Semaphore-based runner Findings batch upload 500 signals/POST Client chunks at 500 Verdicts batch upload 500 verdicts/POST Same pattern Scan task TTL 5 minutes Task reclaim after timeout Tool output max 50 MB stdout, 1 MB stderr Hard limits Container memory default 512 MiB Per Docker tool container Agent binary size ~50 MB Static Go binary
Per-Plan Agent Limits
Plan Max Agents Max Scans/Month Max Services Max Users Starter 2 20 5 3 Pro 10 500 50 20 Business Custom Custom Custom Custom
Estimated Concurrent Agent Capacity
Per control-plane instance (1 vCPU, 2 GB):
Scenario Agents Heartbeats/min Findings/min Notes Idle fleet 200 ~400 0 Heartbeat only, no scans Light scanning 50 ~100 ~500 10 agents actively scanning Heavy scanning 20 ~40 ~2,000 All agents uploading results Burst ingestion 10 ~20 ~5,000 Max batch throughput
Scaling: At 10 ECS tasks (autoscale max), theoretical capacity: 2,000 idle agents or 50,000 findings/min burst ingestion.
⚠️ These are estimated figures based on infrastructure specs and batch sizes. Production load testing required to validate.
---
Findings Ingestion Throughput
Processing Pipeline
Measured Throughput Targets
Scale Findings Expected Ingestion Time DB Impact Small batch 50 < 200 ms 50 individual upserts Standard batch 500 < 1 s 10 chunks × 50 upserts Rapid-fire 5,000 < 10 s (10 POSTs) 100 chunks total
Deduplication
Findings are deduped by the composite unique index (orgId, fingerprint). The upsert path uses ON CONFLICT DO UPDATE, merging lastSeen, severity, and status on duplicates.
Optimization Opportunities
Bulk INSERT — Current path does individual upserts within Promise.all. A VALUES (...), (...), ... bulk insert could yield 5-10x throughput improvement. Streaming ingestion — Agent currently buffers all results before upload. Streaming could reduce latency for large scans. Read replicas — Dashboard read queries could be routed to read replicas to avoid contention with ingestion writes.
---