Disaster Recovery

Recovery procedures for critical failure scenarios.

---

Recovery Targets

Metric Target Notes RPO (Recovery Point Objective) 5 minutes Continuous WAL archiving RTO (Recovery Time Objective) 30 minutes Automated failover Backup frequency Continuous + daily snapshot RDS automated backups Retention 35 days Daily snapshots retained Cross-region replicas 1 (us-west-2 → us-east-1) Async replication

Failure Scenarios

Control Plane Down

Symptoms: Dashboard unreachable, API 5xx errors.

Recovery: bash 1. Check ECS service status aws ecs describe-services --cluster shielda --services control-plane

2. Force new deployment aws ecs update-service --cluster shielda --service control-plane --force-new-deployment

3. Check CloudWatch logs aws logs tail /ecs/shielda-control-plane --follow

4. Verify health curl https://app.shielda.dev/api/health

Fallback: Roll back to previous task definition:

Database Failure

Symptoms: API errors with "connection refused" or "too many connections."

Recovery (RDS Multi-AZ): bash Automatic failover to standby (< 60s) Monitor via CloudWatch alarm: shielda-rds-cpu-high

Manual failover if needed aws rds reboot-db-instance --db-instance-identifier shielda-prod --force-failover

Point-in-Time Recovery: bash Restore to specific timestamp aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier shielda-prod \ --target-db-instance-identifier shielda-prod-restored \ --restore-time "2025-01-15T10:30:00Z"

Update DATABASEURL to point to restored instance Run pending migrations cd control-plane && npx drizzle-kit push

Complete Region Failure

Symptoms: All AWS services in primary region unavailable.

Recovery: bash 1. Promote cross-region read replica aws rds promote-read-replica \ --db-instance-identifier shielda-replica-us-east-1

2. Update DNS to point to DR region (Route 53 health check should auto-failover if configured)

3. Deploy control plane in DR region cd infra/control-plane terraform workspace select dr terraform apply

4. Verify data integrity psql $DRDATABASEURL -c "SELECT count() FROM organizations;"

Agent Fleet Offline

Symptoms: All agents show offline, no heartbeats.

Recovery: bash 1. Check control plane is accepting heartbeats curl -X POST https://app.shielda.dev/api/agents/heartbeat \ -H "Authorization: Bearer test" -d '{}' Should return 401 (not 5xx)

2. If control plane issue, restart (see scenario 1)

3. If agent issue, check agent logs ssh agent-host "journalctl -u shielda-agent --since '1 hour ago'"

4. Restart agent fleet ssh agent-host "systemctl restart shielda-agent" Or for Docker: docker restart shielda-agent

Data Corruption

Symptoms: Inconsistent data, missing records, constraint violations.

Recovery: bash 1. Identify corrupted tables psql $DATABASEURL -c " SELECT schemaname, relname, lastautoanalyze FROM pgstatusertables WHERE ndeadtup 10000 ORDER BY ndeadtup DESC; "

2. Point-in-time restore (see scenario 2)

3. For partial corruption, restore specific tables: pgdump --table=affectedtable shielda-prod-restored psql $DATABASEURL

Backup Verification

Automated (Weekly)

CloudWatch rule triggers backup restoration test: Restore latest snapshot to temporary instance Run integrity checks Drop temporary instance Alert if any check fails

Manual Verification

2. Restore and verify aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier shielda-verify-$(date +%Y%m%d) \ --db-snapshot-identifier <snapshot-id