Disaster Recovery
Recovery procedures for critical failure scenarios.
Recovery procedures for critical failure scenarios.
---
Recovery Targets
Metric Target Notes RPO (Recovery Point Objective) 5 minutes Continuous WAL archiving RTO (Recovery Time Objective) 30 minutes Automated failover Backup frequency Continuous + daily snapshot RDS automated backups Retention 35 days Daily snapshots retained Cross-region replicas 1 (us-west-2 → us-east-1) Async replication
Failure Scenarios
Control Plane Down
Symptoms: Dashboard unreachable, API 5xx errors.
Recovery: bash 1. Check ECS service status aws ecs describe-services --cluster shielda --services control-plane
2. Force new deployment aws ecs update-service --cluster shielda --service control-plane --force-new-deployment
3. Check CloudWatch logs aws logs tail /ecs/shielda-control-plane --follow
4. Verify health curl https://app.shielda.dev/api/health
Fallback: Roll back to previous task definition:
Database Failure
Symptoms: API errors with "connection refused" or "too many connections."
Recovery (RDS Multi-AZ): bash Automatic failover to standby (< 60s) Monitor via CloudWatch alarm: shielda-rds-cpu-high
Manual failover if needed aws rds reboot-db-instance --db-instance-identifier shielda-prod --force-failover
Point-in-Time Recovery: bash Restore to specific timestamp aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier shielda-prod \ --target-db-instance-identifier shielda-prod-restored \ --restore-time "2025-01-15T10:30:00Z"
Update DATABASEURL to point to restored instance Run pending migrations cd control-plane && npx drizzle-kit push
Complete Region Failure
Symptoms: All AWS services in primary region unavailable.
Recovery: bash 1. Promote cross-region read replica aws rds promote-read-replica \ --db-instance-identifier shielda-replica-us-east-1
2. Update DNS to point to DR region (Route 53 health check should auto-failover if configured)
3. Deploy control plane in DR region cd infra/control-plane terraform workspace select dr terraform apply
4. Verify data integrity psql $DRDATABASEURL -c "SELECT count() FROM organizations;"
Agent Fleet Offline
Symptoms: All agents show offline, no heartbeats.
Recovery: bash 1. Check control plane is accepting heartbeats curl -X POST https://app.shielda.dev/api/agents/heartbeat \ -H "Authorization: Bearer test" -d '{}' Should return 401 (not 5xx)
2. If control plane issue, restart (see scenario 1)
3. If agent issue, check agent logs ssh agent-host "journalctl -u shielda-agent --since '1 hour ago'"
4. Restart agent fleet ssh agent-host "systemctl restart shielda-agent" Or for Docker: docker restart shielda-agent
Data Corruption
Symptoms: Inconsistent data, missing records, constraint violations.
Recovery: bash 1. Identify corrupted tables psql $DATABASEURL -c " SELECT schemaname, relname, lastautoanalyze FROM pgstatusertables WHERE ndeadtup 10000 ORDER BY ndeadtup DESC; "
2. Point-in-time restore (see scenario 2)
3. For partial corruption, restore specific tables: pgdump --table=affectedtable shielda-prod-restored psql $DATABASEURL
Backup Verification
Automated (Weekly)
CloudWatch rule triggers backup restoration test: Restore latest snapshot to temporary instance Run integrity checks Drop temporary instance Alert if any check fails
Manual Verification
2. Restore and verify aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier shielda-verify-$(date +%Y%m%d) \ --db-snapshot-identifier <snapshot-id