Backup and Disaster Recovery Strategy
Executive Summary
RestartiX processes state-funded insurance claims where exercise and therapy data serves as proof of service delivery. Loss of this data would:
- Prevent reimbursement claims from insurance providers
- Eliminate fraud prevention evidence
- Expose the organization to legal liability
- Violate HIPAA 6-year medical record retention requirements
- Undermine audit compliance for state funding
Therefore: Data backup is not optional. This document defines a defense-in-depth backup strategy with multiple independent layers.
Compliance and Legal Requirements
Data Retention Mandates
| Data Category | Retention Period | Legal Basis | Consequence of Loss |
|---|---|---|---|
| Exercise/therapy logs | 7 years | State insurance fraud prevention | Cannot prove services delivered, risk fraud accusations |
| Appointments | 6 years | HIPAA medical records | Cannot defend malpractice claims |
| Signed consent forms | 6 years | GDPR Art. 7 + HIPAA | Cannot prove patient consent |
| Prescriptions/reports | 6 years | HIPAA | Medical-legal liability |
| Audit logs | 6 years | HIPAA §164.312(b) | Cannot prove security compliance |
| Insurance claim metadata | 7 years | State financial audit requirements | Cannot reconcile payments |
Why 7 Years?
State financial audits can request records up to 7 years retroactively. If you cannot produce exercise logs proving services were delivered, you may be required to refund state payments and face fraud investigations.
Backup Architecture: The 3-2-1-1 Rule
We implement an enhanced version of the industry-standard 3-2-1 rule, adding a fourth layer for state compliance:
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 0: LIVE PRODUCTION DATABASE │
│ - Production: AWS RDS Postgres 17, Multi-AZ │
│ - Staging: Aurora Serverless v2, single-AZ │
│ - RLS-enforced multi-tenant isolation │
│ - Real-time data, constantly changing │
│ - RPO: 0 seconds (no data loss tolerance during operation) │
│ - See [aws-infrastructure.md](/architecture/aws-infrastructure)│
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 1: AWS-MANAGED BACKUPS (VENDOR-CONTROLLED) │
│ - Continuous WAL streaming → point-in-time recovery (PITR) │
│ - Production retention: 7 days (extensible to 35) │
│ - Staging retention: 1 day (staging-grade) │
│ - Daily automated snapshots │
│ - Manual snapshots before risky deploys │
│ - Cross-AZ standby (production) auto-promotes on failure │
│ - Fast restore (minutes for PITR; hours for snapshot copy) │
│ │
│ ⚠️ LIMITATION: AWS-account-scoped. Account compromise or │
│ catastrophic AWS failure makes this layer unrecoverable. │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 2: DAILY LOGICAL BACKUPS (OUR CONTROL - PRIMARY SAFETY) │
│ - Daily pg_dump to S3 (separate bucket from Layer 0) │
│ - Encrypted with separate envelope key (BACKUP_ENCRYPTION_KEY) │
│ - Versioned and immutable (S3 Object Lock COMPLIANCE mode) │
│ - Retention: 7 years (lifecycled to Glacier Deep Archive) │
│ - Format: Custom PostgreSQL dump (compressed) │
│ │
│ ✅ GUARANTEES: │
│ - Restoreable to any PostgreSQL 17 instance │
│ - Survives RDS-account-level compromise (separate IAM) │
│ - Protected from ransomware (immutable storage) │
│ - Meets state audit requirements │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 3: WEEKLY CROSS-REGION REPLICATION (GEOGRAPHIC SAFETY) │
│ - Weekly copy of Layer 2 backups to a different EU region │
│ - Source: eu-central-1 (Frankfurt) → Replica: eu-west-1 │
│ (Ireland) or eu-west-3 (Paris) │
│ - GDPR: replication target stays inside the EU │
│ - Protection against regional disasters │
│ - Same retention: 7 years │
│ - Same immutability guarantees │
│ │
│ ✅ GUARANTEES: │
│ - Survives entire AWS region failure │
│ - Survives natural disasters (floods, fires) │
│ - Survives data-residency-preserving geographic events │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 4: QUARTERLY OFFLINE ARCHIVE (COLD STORAGE - OPTIONAL) │
│ - Quarterly snapshot exported to air-gapped storage │
│ - Stored offline (disconnected from network) │
│ - Physical media or encrypted external drive │
│ - Retention: 7 years │
│ │
│ ✅ GUARANTEES: │
│ - Survives complete cloud infrastructure compromise │
│ - Ultimate protection against ransomware │
│ - Physical possession for legal/audit purposes │
│ │
│ ⚠️ TRADE-OFF: Manual process, slower restore time │
│ 📋 RECOMMENDATION: Only if state audits explicitly require │
└─────────────────────────────────────────────────────────────────┘Disaster Recovery Scenarios
Scenario Matrix
| Disaster | Layer 1 (RDS PITR) | Layer 2 (Our S3) | Layer 3 (Cross-Region) | Layer 4 (Offline) |
|---|---|---|---|---|
| Accidental DELETE query | ✅ Restore via PITR (5 min) | ✅ Restore from yesterday (30 min) | ✅ Restore from last week (1 hour) | ✅ Restore from last quarter (4 hours) |
| Ransomware encrypts database | ❌ Encrypted | ✅ Immutable backup survives | ✅ Geographic copy survives | ✅ Offline copy survives |
| RDS AZ outage | ✅ Multi-AZ auto-failover (~60s) | ✅ Restore to new instance (1 hour) | ✅ Restore from replica (1 hour) | ✅ Restore from offline (4 hours) |
| AWS account compromised | ❌ All AWS resources at risk | ✅ Separate IAM, survives | ✅ Separate region scope, survives | ✅ Full restore to any Postgres |
| AWS S3 regional failure | ✅ RDS still operational | ❌ Primary backups unavailable | ✅ Cross-region copy available | ✅ Offline copy available |
| Developer accidentally drops table | ✅ PITR restore (10 min) | ✅ Restore specific table (20 min) | ✅ Restore specific table (30 min) | ✅ Restore specific table (2 hours) |
| State audit requests 5-year-old data | ❌ Outside retention window | ✅ Retrieve from S3 archive | ✅ Retrieve from cross-region | ✅ Retrieve from offline archive |
| Hacker deletes RDS snapshots | ❌ AWS-scoped backups compromised | ✅ Separate credentials, survives | ✅ Separate credentials, survives | ✅ Air-gapped, survives |
| Complete internet/cloud collapse | ❌ Inaccessible | ❌ Inaccessible | ❌ Inaccessible | ✅ Physical possession |
Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO)
| Scenario | RTO (Max Downtime) | RPO (Max Data Loss) | Recovery Source |
|---|---|---|---|
| AZ failure (production) | ~60 seconds | 0 seconds | RDS Multi-AZ auto-failover |
| Minor data corruption | 15 minutes | 0 seconds | RDS PITR |
| Table accidentally dropped | 30 minutes | < 1 hour | RDS PITR or daily backup |
| Database-wide corruption | 2 hours | < 24 hours | Daily backup (Layer 2) |
| Regional disaster | 4 hours | < 24 hours | Cross-region backup (Layer 3) |
| Complete provider failure | 8 hours | < 7 days | Weekly cross-region + daily backups |
| Catastrophic global event | 24 hours | < 90 days | Offline archive (Layer 4) |
Implementation Details
Layer 1: AWS-Managed Backups (RDS + Aurora)
What RDS provides (production):
- Continuous backup via Write-Ahead Log (WAL) streaming to S3 (AWS-internal, separate from our Layer 2 bucket)
- Point-in-time recovery (PITR) to any second within retention window
- Daily automated snapshots, retained for the same window
- Manual snapshots — taken before risky deploys, retained indefinitely until explicitly deleted
- Multi-AZ synchronous standby — auto-promoted on primary failure (~60s, DNS endpoint stable)
- Backups taken from the standby — zero performance impact on the primary
What Aurora Serverless v2 provides (staging):
- Same continuous-backup model on Aurora's storage layer
- PITR within the configured retention window
- 1-day retention is sufficient for staging-grade
Configuration:
RDS (production):
Engine: PostgreSQL 17
Instance: db.t4g.medium, Multi-AZ
Storage: 50 GB gp3 (auto-scaling enabled to 200 GB)
Backup retention: 7 days (extensible to 35)
Continuous backup: Enabled (PITR to any second within window)
Snapshot replication: Manual snapshots → Layer 3 cross-region copy
Encryption at rest: AWS-managed KMS
Force SSL: Enabled (rds.force_ssl=1)
Aurora Serverless v2 (staging):
Engine: aurora-postgresql 17
Capacity: 0.5–2 ACU, scale-to-zero
Backup retention: 1 day
Continuous backup: Enabled
Encryption at rest: AWS-managed KMS
Cost (production):
See aws-infrastructure.md → Cost: production day 1.
RDS instance + storage + backups = ~$134/mo at db.t4g.medium Multi-AZ.
Layer 1 backup-specific cost is included in the RDS line; AWS does not bill PITR
storage separately as long as backup retention < instance storage size.Our responsibility:
- Monitor the AWS Health Dashboard for RDS / Aurora regional events
- Test PITR restore monthly into a temporary ephemeral DB (see Testing section)
- Document the restore runbook (Runbook 1 below)
- Take a manual snapshot before any risky migration or schema change
- Track backup-window timing (production maintenance window set to a low-traffic UTC slot)
Layer 2: Daily Logical Backups (CRITICAL LAYER)
Why This Layer is Critical:
- Vendor independence: Can restore to any PostgreSQL provider
- Fraud defense: Immutable proof of historical data
- Audit compliance: Long-term retention (7 years)
- Ransomware protection: Write-once, read-many storage
Backup Schedule:
Frequency: Daily at 02:00 UTC (low-traffic window)
Method: pg_dump with custom format
Compression: gzip level 9
Encryption: AES-256 (separate key from application encryption)Storage Strategy:
Provider: AWS S3 (separate bucket from RDS automated backups; separate IAM,
separate KMS context, different failure domain at the credentials layer)
Bucket Configuration:
- Versioning: Enabled
- Object Lock: COMPLIANCE mode (cannot delete for 7 years)
- Lifecycle Policy:
* Days 0-90: S3 Standard (hot, fast retrieval)
* Days 91-730: S3 Glacier Instant Retrieval (warm)
* Days 731+: S3 Glacier Deep Archive (cold, 12hr retrieval)
- Encryption: AES-256 (server-side, AWS-managed keys)
- Replication: Enable to Layer 3 (cross-region)
Naming Convention:
s3://restartix-backups-primary/
├── daily/
│ ├── 2026-02-15-core-full.pgdump.gz.enc
│ ├── 2026-02-14-core-full.pgdump.gz.enc
│ └── ...
├── weekly/ (Sunday snapshots, kept separately)
│ ├── 2026-02-09-core-full.pgdump.gz.enc
│ └── ...
└── monthly/ (First of month, kept separately)
├── 2026-02-01-core-full.pgdump.gz.enc
└── ...Backup Process (Automated):
- Export:
pg_dump --format=customfrom RDS (Multi-AZ standby endpoint to avoid load on primary; read replica when Phase 2 adds them) - Compress: gzip -9 (90% compression ratio typical)
- Encrypt: AES-256 with backup-specific envelope key (
BACKUP_ENCRYPTION_KEYin Secrets Manager, distinct from application column-encryption keys) - Upload: S3 with metadata (DB size, row counts per org, SHA-256 checksum)
- Verify: Download and re-checksum the uploaded object
- Alert: Notify on backup failure, incomplete upload, or suspiciously small artifact
The job runs as a scheduled ECS task (EventBridge Scheduler → ECS RunTask), uses IAM credentials scoped to S3 write-only on the backup bucket and RDS read on the database.
Estimated Cost:
Database Size: 500 GB
Daily Growth: 1 GB
Compression Ratio: 90% (compressed: 50 GB per backup)
Storage Costs (7 years = 2,555 days):
Year 1 (365 days):
- Daily backups: 365 × 50 GB = 18.25 TB
- 90 days × $0.023/GB = $103/month (S3 Standard)
- 275 days × $0.004/GB = $55/month (Glacier Instant)
Years 2-7 (all in Glacier Deep Archive):
- Total: ~100 TB
- Cost: 100,000 GB × $0.00099/GB = $99/month
Total: ~$160/month (scales with DB size)
Per-org cost: $0.16/month (negligible)Layer 3: Weekly Cross-Region Replication
Purpose:
- Geographic redundancy (survive regional disasters)
- Compliance with state requirements for off-site backups
- Defense against geopolitical/infrastructure risks
Configuration:
Source: eu-central-1 (Frankfurt) - Primary backup bucket
Destination: eu-west-1 (Ireland) - Cross-region replica
[or eu-west-3 (Paris) — both are EU regions and acceptable]
GDPR note: replication target stays inside the EU. US regions are not used
because GDPR Day-1 compliance requires patient data to remain
within the EU. See decisions.md → Why clinic is controller, platform
is processor.
Replication Rule:
- Frequency: Weekly (Sunday after daily backup completes)
- What to replicate: Weekly and monthly backups only (not all daily)
- Storage class: S3 Glacier Instant Retrieval (cheaper, same compliance)
- Retention: 7 years (same as primary)
- Encryption: Replicate with same AES-256
Estimated Cost:
- Storage: ~$50/month (subset of daily backups)
- Data transfer: ~$10/month (cross-region replication, intra-EU)
Total: ~$60/monthLayer 4: Quarterly Offline Archive (OPTIONAL)
When to Implement:
- State auditors explicitly request offline backups
- High-risk contracts with zero-tolerance data loss clauses
- Legal requirement for physical evidence custody
- Enhanced ransomware protection (air-gapped)
Implementation Options:
Option A: Encrypted External Drives
Hardware:
- 2TB enterprise-grade external SSD
- Hardware encryption (FIPS 140-2 certified)
Process:
1. Quarterly: Download latest monthly backup from S3
2. Verify checksum
3. Copy to encrypted drive
4. Store in physical safe (fireproof, waterproof)
5. Document in audit log (who, when, where)
Cost: ~€200/year (drive replacement every 3 years)Option B: Tape Backup (Long-term archive)
Hardware:
- LTO-9 tape drive (~€3,000)
- LTO-9 tapes (~€100/tape, 18TB capacity)
Process:
- Quarterly: Write backup to tape
- Store tapes in off-site vault service
- 30-year shelf life (exceeds 7-year requirement)
Cost: ~€500/year (vault service + tapes)
Recommendation: Only for large institutional deployments with regulatory requirements; not in current platform scopeRecommendation: Start without Layer 4. Add only if:
- State audit explicitly requires it
- Legal counsel advises it
- Insurance policy mandates it
Backup Testing and Validation
Critical Rule: Untested backups are not backups. They are "hopes."
Monthly Restore Test (Automated)
Schedule: 1st of every month, 03:00 UTC
Duration: ~2 hours
Environment: Isolated staging database (not production)
Test Procedure:
1. Select random daily backup from previous month
2. Download from S3
3. Decrypt
4. Decompress
5. Restore to temporary PostgreSQL instance
6. Run validation queries:
- Row count per organization
- Verify RLS policies functional
- Check foreign key integrity
- Sample data spot-checks (10 random appointments)
- Verify encryption keys can decrypt encrypted fields
7. Generate test report
8. Alert on-call if ANY validation fails
9. Destroy temporary instance
Success Criteria:
- Restore completes without errors
- All row counts match backup metadata
- All sampled data is readable and correct
- Time to restore < 2 hoursQuarterly Disaster Recovery Drill
Schedule: Last Saturday of quarter
Duration: 4 hours
Participants: Engineering team + CTO
Drill Scenarios (rotate each quarter):
Q1: RDS regional outage → restore from Layer 2 to a fresh RDS instance
Q2: S3 bucket compromised → restore from Layer 3 (cross-region replica)
Q3: Complete provider failure → restore to a different PostgreSQL host
(e.g., self-hosted on Hetzner, or a non-AWS cloud) to verify vendor
independence holds
Q4: Ransomware attack → restore from immutable Object-Locked backup
Success Criteria:
- Full production database restored to functional state
- Application can connect and serve requests
- RTO/RPO targets met
- All team members understand procedure
- Runbook updated with lessons learnedAnnual Audit Compliance Test
Schedule: Before annual state audit
Duration: 1 day
Purpose: Prove 7-year retention and data integrity
Test Procedure:
1. Select 10 random patients from 5-7 years ago
2. Restore backup from that period (Layer 2 or 3)
3. Extract their exercise logs, appointments, consent forms
4. Verify data is complete and unmodified
5. Generate audit report with:
- Patient names (anonymized for test)
- Service dates
- Exercise/therapy session counts
- Proof of consent signatures
6. Present to auditor (if requested)
Success Criteria:
- All requested historical data retrievable
- Data matches original records (if cross-referenced)
- Restore time < 4 hours
- Data format is human-readable (for auditor review)Data Integrity and Immutability
Cryptographic Verification
Every backup includes:
{
"backup_id": "2026-02-15-daily-001",
"timestamp": "2026-02-15T02:00:00Z",
"database_size_bytes": 524288000,
"compressed_size_bytes": 52428800,
"sha256_checksum": "a1b2c3d4e5f6...",
"encryption_key_version": 2,
"organization_count": 1000,
"row_counts": {
"appointments": 45000,
"patients": 12000,
"exercise_logs": 180000,
"forms": 30000
}
}Verification Process:
- Before upload: Calculate SHA-256 checksum
- After upload: Download first 1MB and verify partial checksum
- Monthly test: Full download and checksum verification
- Before restore: Verify checksum matches metadata
Why? Detects:
- Silent data corruption during transfer
- Bitrot in storage media
- Tampering attempts
- Incomplete uploads
Immutability Enforcement
S3 Object Lock (COMPLIANCE Mode):
Configuration:
Mode: COMPLIANCE
Retention: 7 years from creation date
Guarantees:
- Cannot be deleted by anyone (even AWS root account)
- Cannot be modified (append-only)
- Cannot shorten retention period
- Can only be deleted after 7 years expire
Legal Basis:
- HIPAA: 6-year medical record retention
- State: 7-year financial audit window
- GDPR: Allows retention for legal compliance (Art. 17(3))Ransomware Protection: Even if an attacker:
- Compromises AWS credentials
- Deletes production database
- Deletes RDS automated snapshots and PITR retention
- Attempts to delete S3 backups
Result: Backups survive. Object Lock prevents deletion.
Backup Security
Access Control
Who Can Access Backups:
Production Database (RDS):
- Application Fargate task role (read/write via pgbouncer, RLS-scoped)
- Migration ECS task role (DDL, bypasses pgbouncer with DATABASE_DIRECT_URL)
- Database administrator IAM role (superadmin, used only via SSM Session Manager)
Layer 1 Backups (RDS automated snapshots + PITR):
- Same RDS-account-scoped IAM controls as the live DB
- Manual snapshot creation requires the operations IAM role
Layer 2 Backups (Our S3):
- Automated backup job (write-only IAM role on the backup bucket)
- Database administrator (read-only for restore)
- Security team (read-only for audit)
- Bucket has separate KMS context from RDS — compromised RDS key cannot
decrypt Layer 2
Layer 3 Backups (Cross-region replica):
- Replication service account (write-only)
- CTO + on-call lead (read-only for disaster recovery)
Layer 4 Backups (Offline, optional):
- Physical access: CTO + COO (dual-custody)
Principle: Minimum necessary access, separation of duties, separate trust
domains across layers (compromised credentials at one layer cannot
reach the next).Encryption Keys
Key Hierarchy:
Application Data Encryption:
- Purpose: Encrypt sensitive fields (phone, API keys)
- Storage: AWS Secrets Manager
- Rotation: Quarterly
Backup Encryption:
- Purpose: Encrypt backup files before S3 upload
- Storage: Separate from application keys (AWS Secrets Manager)
- Rotation: Annually
- Why separate? If app keys compromised, backups remain safe
S3 Server-Side Encryption:
- Purpose: Encryption at rest in S3
- Storage: AWS-managed keys (SSE-S3)
- Rotation: Automatic (AWS handles)Key Backup: All encryption keys backed up to:
- Password manager (1Password/Bitwarden) - shared vault, restricted access
- Printed copy in physical safe (disaster recovery)
State Audit Compliance
What Auditors Will Request
Based on typical state insurance audits:
| Request | How We Provide It | Source |
|---|---|---|
| "Prove services were delivered for Patient X in 2023" | Export exercise logs, appointments, signed forms | Layer 2/3 backup (historical) |
| "Show all payments received vs services delivered" | Cross-reference appointments with invoices | Audit log + backup |
| "Prove this data hasn't been tampered with" | SHA-256 checksums, immutable S3 Object Lock | Backup metadata |
| "How do you prevent data loss?" | This document + test reports | Documentation |
| "Show me a backup from 5 years ago" | Restore from Layer 2 (Glacier Deep Archive) | S3 lifecycle retrieval |
| "What if your cloud provider fails?" | Layer 3 cross-region backup | Alternative provider restore |
| "Prove patients consented to treatment" | Signed consent forms with timestamps | Forms backup (status='signed') |
Audit-Ready Documentation
Maintain in a physical binder (for in-person audits):
- This backup strategy document (printed)
- Monthly backup test reports (last 12 months)
- Quarterly DR drill reports (last 4 quarters)
- Backup retention policy (signed by CTO)
- Data processing agreement with AWS (DPA + signed BAA)
- Sub-processor list (Cloudflare, Clerk, Daily.co, Anthropic — see external-providers.md)
- Encryption key rotation logs (dates only, not keys)
- Incident response plan (see monitoring.md)
Operational Runbooks
Runbook 1: Restore from RDS PITR (Minor Issues)
When to Use: Accidental DELETE/UPDATE, recent data corruption, anything inside the 7-day retention window.
Steps:
- Identify exact timestamp of corruption (check
audit_logfor the offending action; the audit row carriescreated_atand the actor) - Open the AWS console → RDS → the production cluster
- Action → "Restore to point in time"
- Select restore time (up to second precision within the retention window)
- Critical: restore into a NEW DB instance, not by overwriting the live one. New instance name pattern:
restartix-prod-pitr-YYYYMMDDhhmm - Wait for restore (typically 15–30 minutes for a db.t4g.medium-sized DB)
- Connect via SSM Session Manager port forwarding to the new instance and validate: row counts, the specific data that was lost, RLS policy presence (
\d+ patientsetc.) - If correct, choose recovery path:
- Option A (preferred for partial recovery): export the recovered rows from the new instance and
INSERT … ON CONFLICTthem back into the live primary. No application downtime. - Option B (for catastrophic recovery): point the application at the new instance by updating Secrets Manager
DATABASE_URLand forcing an ECS service restart. The old instance becomes an evidence artifact.
- Option A (preferred for partial recovery): export the recovered rows from the new instance and
- After verifying, delete the temporary instance OR snapshot it for later evidence
- Document the incident in the audit-log via the operations IAM role (action
database.pitr_restore, with timestamps and decision rationale)
RTO: 30–60 minutes (most of which is RDS spin-up time, not user-facing) RPO: 0 seconds (PITR is per-second within the 7-day retention window)
Runbook 2: Restore from Daily Backup (Database Corruption)
When to Use: RDS automated backups unavailable, major corruption, data older than the 7-day PITR window, AWS-account-level compromise.
Steps:
- Identify target restore date
- Download backup from S3:
aws s3 cp s3://restartix-backups-primary/daily/YYYY-MM-DD-core-full.pgdump.gz.enc ./ - Verify checksum:
sha256sum YYYY-MM-DD-core-full.pgdump.gz.enc # Compare with metadata file - Decrypt:
openssl enc -d -aes-256-cbc -in backup.enc -out backup.pgdump.gz -k $BACKUP_ENCRYPTION_KEY - Decompress:
gunzip backup.pgdump.gz - Provision new PostgreSQL instance (any PostgreSQL 17 — fresh RDS in another region, RDS in another AWS account, self-hosted on Hetzner / a different cloud — vendor independence is the point of Layer 2)
- Restore:
pg_restore -d restartix_platform -v backup.pgdump - Verify:
- Row counts per organization
- Sample data spot-checks
- Application can connect
- Switch application connection string to restored instance
- Monitor for issues (check logs, error rates)
- Document incident
RTO: 1-2 hours RPO: < 24 hours
Runbook 3: Restore from Cross-Region Backup (Regional Disaster)
When to Use: AWS region failure, primary S3 bucket unavailable, account-scoped credential compromise.
Steps:
- Access cross-region backup bucket:
aws s3 ls s3://restartix-backups-replica/weekly/ - Download most recent weekly backup
- Follow Runbook 2 steps 3-11 (same restore procedure)
- Provision instance in DIFFERENT region
- Update DNS / load balancer to point to new region
RTO: 2-4 hours RPO: < 7 days (weekly backup)
Runbook 4: Restore from Offline Archive (Catastrophic Scenario)
When to Use: All cloud infrastructure compromised/unavailable
Steps:
- Retrieve offline backup media from physical safe (requires dual-custody)
- Connect encrypted drive to secure workstation (air-gapped)
- Decrypt and extract backup
- Provision PostgreSQL instance (on-premises or different cloud provider)
- Follow Runbook 2 steps 6-11
- Manually configure application deployment to new infrastructure
RTO: 8-24 hours RPO: < 90 days (quarterly backup)
Monitoring and Alerting
Backup Health Metrics
| Metric | Alert Threshold | Severity | Action |
|---|---|---|---|
| Daily backup failed | 1 failure | Critical | Page on-call, investigate immediately |
| Backup size anomaly | ±50% from expected | High | Verify data integrity, check for corruption |
| Backup upload incomplete | Any incomplete | Critical | Retry upload, verify network |
| Checksum mismatch | Any mismatch | Critical | Re-run backup, investigate corruption |
| S3 bucket replication lag | > 24 hours | Medium | Check replication rules, AWS status |
| Monthly restore test failed | Any failure | High | Debug restore procedure, fix issues |
| Backup older than 25 hours | No new backup in 25h | High | Check backup ECS task logs, RDS connectivity |
Dashboards
Grafana / Datadog:
- Backup job success rate (7-day trend)
- Backup file sizes (detect growth anomalies)
- Restore test results (monthly pass/fail)
- S3 storage costs (budget monitoring)
- Time to complete backup (performance trend)
Cost Summary
Backup-specific cost (excludes the live database, which is itemized in aws-infrastructure.md → Cost: production day 1):
| Layer | Provider | Monthly Cost | Annual Cost | Purpose |
|---|---|---|---|---|
| 1: RDS PITR + snapshots | AWS RDS | Included in RDS | Included | Fast PITR (7 days) |
| 2: Daily logical | AWS S3 | ~$160 | ~$1,920 | Primary long-term safety |
| 3: Cross-region (EU) | AWS S3 | ~$60 | ~$720 | Geographic redundancy |
| 4: Offline (optional) | External SSD | ~$17 | ~$200 | Audit compliance (if required) |
| Backup total | ~$220–237 | ~$2,640–2,840 | Full DR posture |
Cost assumes a 500 GB database with ~1 GB daily growth and 90% pg_dump compression — the same scale model as the original target of "1000 orgs / 500 GB DB" used elsewhere in this document. Actual cost scales linearly with DB size.
Per-Organization Cost: ~$0.22–0.24/month for comprehensive backup protection (at the 1000-org reference scale).
ROI Calculation:
- Cost of data loss: Inability to claim insurance reimbursements + fraud liability + legal costs = Millions of euros
- Cost of backup: €6,000-9,000/year
- ROI: Infinite (prevents catastrophic loss)
Implementation Status
The backup architecture is implemented in two waves: what closes with Foundation 1E.3 (AWS staging deployment) and what closes after the launch.
Closes with 1E.3 (foundation gate, before any production data)
The 1E.3 scope validates that the substrate works — Layer 1 active, Layer 2 IaC ships and the runbook is exercised end-to-end at least once. Whether the daily Layer 2 cron fires daily in staging is a separate knob (see "Staging knobs" below); 1E.3 doesn't require the cron to keep running.
Layer 1 (RDS / Aurora PITR + snapshots):
- [ ] RDS Multi-AZ in
eu-central-1with 7-day automated backup retention + PITR - [ ] Aurora Serverless v2 in staging with 1-day backup retention
- [ ] CloudWatch alarms for
BackupRetentionPeriodStorageUsed, replica lag (Multi-AZ), failed snapshots
Layer 2 (daily pg_dump to S3) — IaC + one validated end-to-end test:
- [ ] S3 backup bucket (
restartix-backups-primary-{env}) with versioning enabled - [ ] S3 Object Lock in COMPLIANCE mode, 7-year retention on every object
- [ ] S3 lifecycle policies (Standard → Glacier IA at 90 days → Glacier Deep Archive at 365 days)
- [ ]
BACKUP_ENCRYPTION_KEYprovisioned in Secrets Manager, separate from application column-encryption keys - [ ] Daily
pg_dumpECS task definition + EventBridge Scheduler rule provisioned (production schedule on; staging schedule off by default — see knobs) - [ ] IAM role for the backup task (S3 write-only on the backup bucket, RDS read on the database)
- [ ] CloudWatch alarms for backup-failure, backup-too-old, checksum-mismatch (wired up; signal threshold tuning happens after the cron starts firing in production)
- [ ] One manual end-to-end run against staging passes: pg_dump → gzip → encrypt → S3 upload → checksum verify → metadata recorded → restore-from-this-artifact runbook (Runbook 2) restores cleanly to a temporary RDS instance
The Terraform module that provisions this is the same module production reuses. Production launch should not be the first terraform apply of this code.
Staging knobs (turn on as needed before production launch)
- Daily Layer 2 cron in staging. Off by default after 1E.3 (no real data, no cost benefit, just generates noise). Enable when a) tuning the backup-failure / backup-too-old alarm thresholds, or b) running a production-launch dress rehearsal with migrated legacy data. EventBridge schedule is a Terraform variable — no code change needed to flip.
Closes before production launch (operational gate, separate from F11)
- [ ] Production-launch dress rehearsal: backup runs against the migrated legacy data, restore is exercised, all alarms are calibrated against real signals
- [ ] On-call understands Runbook 2 (restore-from-daily-backup) end-to-end
Closes after launch (within first quarter)
- [ ] Cross-region replication to a second EU region (Layer 3, weekly cadence)
- [ ] Monthly automated restore test (Layer 2 to ephemeral DB, validation queries, alerting)
- [ ] Quarterly DR drill (rotating scenarios per the matrix above)
- [ ] Audit-compliance binder populated with the first quarter's reports
Ongoing
- Monthly automated restore tests
- Quarterly DR drills (team exercise)
- Annual audit preparation
- Review backup strategy yearly; update for any infra changes
Related Documentation
- Database Overview - All tables and multi-tenant architecture
- RLS Policies - Data isolation and security
- Encryption - Data protection at rest and in transit
- GDPR Compliance - Data retention and erasure
- Monitoring - Alerting and incident response
- Audit Log - Audit trail for compliance
Appendix: Fraud Prevention Evidence Requirements
What Data Proves Services Were Delivered?
For state insurance audits, the following data constitutes proof of service:
| Evidence Type | Data Source | Retention | Why It Matters |
|---|---|---|---|
| Appointment attendance | appointments.status = 'done' | 7 years | Proves patient attended session |
| Exercise/therapy logs | (Future feature - telemetry service) | 7 years | Proves exercises were performed |
| Video call metadata | appointments.daily_room_name + Daily.co logs | 7 years | Proves real-time interaction occurred |
| Specialist notes | appointment_documents (reports) | 7 years | Medical documentation of session |
| Patient consent | forms.status = 'signed' | 7 years | Proves patient authorized treatment |
| Prescription issuance | appointment_documents (prescriptions) | 7 years | Proves medical care provided |
| Payment records | (External billing system) | 7 years | Cross-reference with services |
Critical: Without backups, you cannot produce this evidence. Insurance claims can be retroactively denied up to 7 years later.
Example Audit Query
Auditor requests: "Prove services delivered for Patient ID 12345 in July 2023"
Our Response (from backup):
- Restore July 2023 backup
- Query:sql
SELECT a.started_at, a.ended_at, a.status, s.name AS specialist_name, (SELECT COUNT(*) FROM forms WHERE appointment_id = a.id AND status = 'signed') AS signed_forms, (SELECT COUNT(*) FROM appointment_documents WHERE appointment_id = a.id) AS documents_generated FROM appointments a JOIN specialists s ON a.specialist_id = s.id WHERE a.patient_id = 12345 AND a.started_at BETWEEN '2023-07-01' AND '2023-07-31' AND a.status = 'done'; - Export signed consent forms (PDF)
- Export prescription/report documents (PDF)
- Provide to auditor with checksums (proof of authenticity)
Result: Audit passed, no fraud accusations, insurance reimbursements validated.
Questions for Legal/Compliance Team
Before finalizing backup strategy, confirm with legal counsel:
- Retention period: Is 7 years sufficient, or does your state require longer?
- Offline backup: Does state audit explicitly require physical/offline backups?
- Geographic requirements: Must backups be stored within EU? Or can cross-region be US?
- Data sovereignty: Are there restrictions on cloud provider jurisdiction?
- Encryption standards: Are AES-256 and current key management procedures compliant?
- Audit frequency: How often should we expect state audits? (Affects test schedule)
- Evidence format: Do auditors require specific export formats (PDF, CSV, etc.)?
Action: Schedule meeting with legal team to review this document and confirm compliance requirements.
Document Version History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-02-15 | Engineering Team | Initial backup strategy for state-funded insurance compliance |
| 2.0 | 2026-05-07 | Engineering Team | Reframed Layer 0 / Layer 1 from Neon to AWS RDS + Aurora Serverless v2; cross-region target moved to a second EU region for GDPR compliance; runbooks updated for RDS PITR; implementation timeline aligned with Foundation 1E.3 |
Next Steps: Implement during 1E.3 (Foundation gate); first DR drill within the quarter after launch.