Backup and Disaster Recovery Strategy

Executive Summary

RestartiX processes state-funded insurance claims where exercise and therapy data serves as proof of service delivery. Loss of this data would:

Prevent reimbursement claims from insurance providers
Eliminate fraud prevention evidence
Expose the organization to legal liability
Violate HIPAA 6-year medical record retention requirements
Undermine audit compliance for state funding

Therefore: Data backup is not optional. This document defines a defense-in-depth backup strategy with multiple independent layers.

Compliance and Legal Requirements

Data Retention Mandates

Data Category	Retention Period	Legal Basis	Consequence of Loss
Exercise/therapy logs	7 years	State insurance fraud prevention	Cannot prove services delivered, risk fraud accusations
Appointments	6 years	HIPAA medical records	Cannot defend malpractice claims
Signed consent forms	6 years	GDPR Art. 7 + HIPAA	Cannot prove patient consent
Prescriptions/reports	6 years	HIPAA	Medical-legal liability
Audit logs	6 years	HIPAA §164.312(b)	Cannot prove security compliance
Insurance claim metadata	7 years	State financial audit requirements	Cannot reconcile payments

Why 7 Years?

State financial audits can request records up to 7 years retroactively. If you cannot produce exercise logs proving services were delivered, you may be required to refund state payments and face fraud investigations.

Backup Architecture: The 3-2-1-1 Rule

We implement an enhanced version of the industry-standard 3-2-1 rule, adding a fourth layer for state compliance:

┌─────────────────────────────────────────────────────────────────┐
│  LAYER 0: LIVE PRODUCTION DATABASE                              │
│  - Production: AWS RDS Postgres 17, Multi-AZ                    │
│  - Staging:    Aurora Serverless v2, single-AZ                  │
│  - RLS-enforced multi-tenant isolation                          │
│  - Real-time data, constantly changing                          │
│  - RPO: 0 seconds (no data loss tolerance during operation)     │
│  - See [aws-infrastructure.md](/architecture/aws-infrastructure)│
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  LAYER 1: AWS-MANAGED BACKUPS (VENDOR-CONTROLLED)               │
│  - Continuous WAL streaming → point-in-time recovery (PITR)     │
│  - Production retention: 7 days (extensible to 35)              │
│  - Staging retention: 1 day (staging-grade)                     │
│  - Daily automated snapshots                                    │
│  - Manual snapshots before risky deploys                        │
│  - Cross-AZ standby (production) auto-promotes on failure       │
│  - Fast restore (minutes for PITR; hours for snapshot copy)     │
│                                                                  │
│  ⚠️  LIMITATION: AWS-account-scoped. Account compromise or      │
│      catastrophic AWS failure makes this layer unrecoverable.   │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  LAYER 2: DAILY LOGICAL BACKUPS (OUR CONTROL - PRIMARY SAFETY) │
│  - Daily pg_dump to S3 (separate bucket from Layer 0)          │
│  - Encrypted with separate envelope key (BACKUP_ENCRYPTION_KEY) │
│  - Versioned and immutable (S3 Object Lock COMPLIANCE mode)    │
│  - Retention: 7 years (lifecycled to Glacier Deep Archive)     │
│  - Format: Custom PostgreSQL dump (compressed)                  │
│                                                                  │
│  ✅ GUARANTEES:                                                 │
│     - Restoreable to any PostgreSQL 17 instance                 │
│     - Survives RDS-account-level compromise (separate IAM)      │
│     - Protected from ransomware (immutable storage)             │
│     - Meets state audit requirements                            │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  LAYER 3: WEEKLY CROSS-REGION REPLICATION (GEOGRAPHIC SAFETY)  │
│  - Weekly copy of Layer 2 backups to a different EU region     │
│  - Source: eu-central-1 (Frankfurt) → Replica: eu-west-1       │
│    (Ireland) or eu-west-3 (Paris)                              │
│  - GDPR: replication target stays inside the EU                │
│  - Protection against regional disasters                        │
│  - Same retention: 7 years                                      │
│  - Same immutability guarantees                                 │
│                                                                  │
│  ✅ GUARANTEES:                                                 │
│     - Survives entire AWS region failure                        │
│     - Survives natural disasters (floods, fires)                │
│     - Survives data-residency-preserving geographic events      │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  LAYER 4: QUARTERLY OFFLINE ARCHIVE (COLD STORAGE - OPTIONAL)  │
│  - Quarterly snapshot exported to air-gapped storage            │
│  - Stored offline (disconnected from network)                   │
│  - Physical media or encrypted external drive                   │
│  - Retention: 7 years                                           │
│                                                                  │
│  ✅ GUARANTEES:                                                 │
│     - Survives complete cloud infrastructure compromise         │
│     - Ultimate protection against ransomware                    │
│     - Physical possession for legal/audit purposes              │
│                                                                  │
│  ⚠️  TRADE-OFF: Manual process, slower restore time            │
│  📋 RECOMMENDATION: Only if state audits explicitly require     │
└─────────────────────────────────────────────────────────────────┘

Disaster Recovery Scenarios

Scenario Matrix

Disaster	Layer 1 (RDS PITR)	Layer 2 (Our S3)	Layer 3 (Cross-Region)	Layer 4 (Offline)
Accidental DELETE query	✅ Restore via PITR (5 min)	✅ Restore from yesterday (30 min)	✅ Restore from last week (1 hour)	✅ Restore from last quarter (4 hours)
Ransomware encrypts database	❌ Encrypted	✅ Immutable backup survives	✅ Geographic copy survives	✅ Offline copy survives
RDS AZ outage	✅ Multi-AZ auto-failover (~60s)	✅ Restore to new instance (1 hour)	✅ Restore from replica (1 hour)	✅ Restore from offline (4 hours)
AWS account compromised	❌ All AWS resources at risk	✅ Separate IAM, survives	✅ Separate region scope, survives	✅ Full restore to any Postgres
AWS S3 regional failure	✅ RDS still operational	❌ Primary backups unavailable	✅ Cross-region copy available	✅ Offline copy available
Developer accidentally drops table	✅ PITR restore (10 min)	✅ Restore specific table (20 min)	✅ Restore specific table (30 min)	✅ Restore specific table (2 hours)
State audit requests 5-year-old data	❌ Outside retention window	✅ Retrieve from S3 archive	✅ Retrieve from cross-region	✅ Retrieve from offline archive
Hacker deletes RDS snapshots	❌ AWS-scoped backups compromised	✅ Separate credentials, survives	✅ Separate credentials, survives	✅ Air-gapped, survives
Complete internet/cloud collapse	❌ Inaccessible	❌ Inaccessible	❌ Inaccessible	✅ Physical possession

Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO)

Scenario	RTO (Max Downtime)	RPO (Max Data Loss)	Recovery Source
AZ failure (production)	~60 seconds	0 seconds	RDS Multi-AZ auto-failover
Minor data corruption	15 minutes	0 seconds	RDS PITR
Table accidentally dropped	30 minutes	< 1 hour	RDS PITR or daily backup
Database-wide corruption	2 hours	< 24 hours	Daily backup (Layer 2)
Regional disaster	4 hours	< 24 hours	Cross-region backup (Layer 3)
Complete provider failure	8 hours	< 7 days	Weekly cross-region + daily backups
Catastrophic global event	24 hours	< 90 days	Offline archive (Layer 4)

Implementation Details

Layer 1: AWS-Managed Backups (RDS + Aurora)

What RDS provides (production):

Continuous backup via Write-Ahead Log (WAL) streaming to S3 (AWS-internal, separate from our Layer 2 bucket)
Point-in-time recovery (PITR) to any second within retention window
Daily automated snapshots, retained for the same window
Manual snapshots — taken before risky deploys, retained indefinitely until explicitly deleted
Multi-AZ synchronous standby — auto-promoted on primary failure (~60s, DNS endpoint stable)
Backups taken from the standby — zero performance impact on the primary

What Aurora Serverless v2 provides (staging):

Same continuous-backup model on Aurora's storage layer
PITR within the configured retention window
1-day retention is sufficient for staging-grade

Configuration:

yaml

RDS (production):
  Engine: PostgreSQL 17
  Instance: db.t4g.medium, Multi-AZ
  Storage: 50 GB gp3 (auto-scaling enabled to 200 GB)
  Backup retention: 7 days (extensible to 35)
  Continuous backup: Enabled (PITR to any second within window)
  Snapshot replication: Manual snapshots → Layer 3 cross-region copy
  Encryption at rest: AWS-managed KMS
  Force SSL: Enabled (rds.force_ssl=1)

Aurora Serverless v2 (staging):
  Engine: aurora-postgresql 17
  Capacity: 0.5–2 ACU, scale-to-zero
  Backup retention: 1 day
  Continuous backup: Enabled
  Encryption at rest: AWS-managed KMS

Cost (production):
  See aws-infrastructure.md → Cost: production day 1.
  RDS instance + storage + backups = ~$134/mo at db.t4g.medium Multi-AZ.
  Layer 1 backup-specific cost is included in the RDS line; AWS does not bill PITR
  storage separately as long as backup retention < instance storage size.

Our responsibility:

Monitor the AWS Health Dashboard for RDS / Aurora regional events
Test PITR restore monthly into a temporary ephemeral DB (see Testing section)
Document the restore runbook (Runbook 1 below)
Take a manual snapshot before any risky migration or schema change
Track backup-window timing (production maintenance window set to a low-traffic UTC slot)

Layer 2: Daily Logical Backups (CRITICAL LAYER)

Why This Layer is Critical:

Vendor independence: Can restore to any PostgreSQL provider
Fraud defense: Immutable proof of historical data
Audit compliance: Long-term retention (7 years)
Ransomware protection: Write-once, read-many storage

Backup Schedule:

yaml

Frequency: Daily at 02:00 UTC (low-traffic window)
Method: pg_dump with custom format
Compression: gzip level 9
Encryption: AES-256 (separate key from application encryption)

Storage Strategy:

yaml

Provider: AWS S3 (separate bucket from RDS automated backups; separate IAM,
          separate KMS context, different failure domain at the credentials layer)

Bucket Configuration:
  - Versioning: Enabled
  - Object Lock: COMPLIANCE mode (cannot delete for 7 years)
  - Lifecycle Policy:
      * Days 0-90:   S3 Standard (hot, fast retrieval)
      * Days 91-730: S3 Glacier Instant Retrieval (warm)
      * Days 731+:   S3 Glacier Deep Archive (cold, 12hr retrieval)
  - Encryption: AES-256 (server-side, AWS-managed keys)
  - Replication: Enable to Layer 3 (cross-region)

Naming Convention:
  s3://restartix-backups-primary/
    ├── daily/
    │   ├── 2026-02-15-core-full.pgdump.gz.enc
    │   ├── 2026-02-14-core-full.pgdump.gz.enc
    │   └── ...
    ├── weekly/  (Sunday snapshots, kept separately)
    │   ├── 2026-02-09-core-full.pgdump.gz.enc
    │   └── ...
    └── monthly/ (First of month, kept separately)
        ├── 2026-02-01-core-full.pgdump.gz.enc
        └── ...

Backup Process (Automated):

Export: pg_dump --format=custom from RDS (Multi-AZ standby endpoint to avoid load on primary; read replica when Phase 2 adds them)
Compress: gzip -9 (90% compression ratio typical)
Encrypt: AES-256 with backup-specific envelope key (BACKUP_ENCRYPTION_KEY in Secrets Manager, distinct from application column-encryption keys)
Upload: S3 with metadata (DB size, row counts per org, SHA-256 checksum)
Verify: Download and re-checksum the uploaded object
Alert: Notify on backup failure, incomplete upload, or suspiciously small artifact

The job runs as a scheduled ECS task (EventBridge Scheduler → ECS RunTask), uses IAM credentials scoped to S3 write-only on the backup bucket and RDS read on the database.

Estimated Cost:

yaml

Database Size: 500 GB
Daily Growth: 1 GB
Compression Ratio: 90% (compressed: 50 GB per backup)

Storage Costs (7 years = 2,555 days):
  Year 1 (365 days):
    - Daily backups: 365 × 50 GB = 18.25 TB
    - 90 days × $0.023/GB = $103/month (S3 Standard)
    - 275 days × $0.004/GB = $55/month (Glacier Instant)

  Years 2-7 (all in Glacier Deep Archive):
    - Total: ~100 TB
    - Cost: 100,000 GB × $0.00099/GB = $99/month

Total: ~$160/month (scales with DB size)
Per-org cost: $0.16/month (negligible)

Layer 3: Weekly Cross-Region Replication

Purpose:

Geographic redundancy (survive regional disasters)
Compliance with state requirements for off-site backups
Defense against geopolitical/infrastructure risks

Configuration:

yaml

Source: eu-central-1 (Frankfurt) - Primary backup bucket
Destination: eu-west-1 (Ireland) - Cross-region replica
              [or eu-west-3 (Paris) — both are EU regions and acceptable]

GDPR note: replication target stays inside the EU. US regions are not used
            because GDPR Day-1 compliance requires patient data to remain
            within the EU. See decisions.md → Why clinic is controller, platform
            is processor.

Replication Rule:
  - Frequency: Weekly (Sunday after daily backup completes)
  - What to replicate: Weekly and monthly backups only (not all daily)
  - Storage class: S3 Glacier Instant Retrieval (cheaper, same compliance)
  - Retention: 7 years (same as primary)
  - Encryption: Replicate with same AES-256

Estimated Cost:
  - Storage: ~$50/month (subset of daily backups)
  - Data transfer: ~$10/month (cross-region replication, intra-EU)
  Total: ~$60/month

Layer 4: Quarterly Offline Archive (OPTIONAL)

When to Implement:

State auditors explicitly request offline backups
High-risk contracts with zero-tolerance data loss clauses
Legal requirement for physical evidence custody
Enhanced ransomware protection (air-gapped)

Implementation Options:

Option A: Encrypted External Drives

yaml

Hardware:
  - 2TB enterprise-grade external SSD
  - Hardware encryption (FIPS 140-2 certified)

Process:
  1. Quarterly: Download latest monthly backup from S3
  2. Verify checksum
  3. Copy to encrypted drive
  4. Store in physical safe (fireproof, waterproof)
  5. Document in audit log (who, when, where)

Cost: ~€200/year (drive replacement every 3 years)

Option B: Tape Backup (Long-term archive)

yaml

Hardware:
    - LTO-9 tape drive (~€3,000)
    - LTO-9 tapes (~€100/tape, 18TB capacity)

Process:
    - Quarterly: Write backup to tape
    - Store tapes in off-site vault service
    - 30-year shelf life (exceeds 7-year requirement)

Cost: ~€500/year (vault service + tapes)
Recommendation: Only for large institutional deployments with regulatory requirements; not in current platform scope

Recommendation: Start without Layer 4. Add only if:

State audit explicitly requires it
Legal counsel advises it
Insurance policy mandates it

Backup Testing and Validation

Critical Rule: Untested backups are not backups. They are "hopes."

Monthly Restore Test (Automated)

yaml

Schedule: 1st of every month, 03:00 UTC
Duration: ~2 hours
Environment: Isolated staging database (not production)

Test Procedure:
  1. Select random daily backup from previous month
  2. Download from S3
  3. Decrypt
  4. Decompress
  5. Restore to temporary PostgreSQL instance
  6. Run validation queries:
     - Row count per organization
     - Verify RLS policies functional
     - Check foreign key integrity
     - Sample data spot-checks (10 random appointments)
     - Verify encryption keys can decrypt encrypted fields
  7. Generate test report
  8. Alert on-call if ANY validation fails
  9. Destroy temporary instance

Success Criteria:
  - Restore completes without errors
  - All row counts match backup metadata
  - All sampled data is readable and correct
  - Time to restore < 2 hours

Quarterly Disaster Recovery Drill

yaml

Schedule: Last Saturday of quarter
Duration: 4 hours
Participants: Engineering team + CTO

Drill Scenarios (rotate each quarter):
    Q1: RDS regional outage → restore from Layer 2 to a fresh RDS instance
    Q2: S3 bucket compromised → restore from Layer 3 (cross-region replica)
    Q3: Complete provider failure → restore to a different PostgreSQL host
        (e.g., self-hosted on Hetzner, or a non-AWS cloud) to verify vendor
        independence holds
    Q4: Ransomware attack → restore from immutable Object-Locked backup

Success Criteria:
    - Full production database restored to functional state
    - Application can connect and serve requests
    - RTO/RPO targets met
    - All team members understand procedure
    - Runbook updated with lessons learned

Annual Audit Compliance Test

yaml

Schedule: Before annual state audit
Duration: 1 day
Purpose: Prove 7-year retention and data integrity

Test Procedure:
  1. Select 10 random patients from 5-7 years ago
  2. Restore backup from that period (Layer 2 or 3)
  3. Extract their exercise logs, appointments, consent forms
  4. Verify data is complete and unmodified
  5. Generate audit report with:
     - Patient names (anonymized for test)
     - Service dates
     - Exercise/therapy session counts
     - Proof of consent signatures
  6. Present to auditor (if requested)

Success Criteria:
  - All requested historical data retrievable
  - Data matches original records (if cross-referenced)
  - Restore time < 4 hours
  - Data format is human-readable (for auditor review)

Data Integrity and Immutability

Cryptographic Verification

Every backup includes:

json

{
    "backup_id": "2026-02-15-daily-001",
    "timestamp": "2026-02-15T02:00:00Z",
    "database_size_bytes": 524288000,
    "compressed_size_bytes": 52428800,
    "sha256_checksum": "a1b2c3d4e5f6...",
    "encryption_key_version": 2,
    "organization_count": 1000,
    "row_counts": {
        "appointments": 45000,
        "patients": 12000,
        "exercise_logs": 180000,
        "forms": 30000
    }
}

Verification Process:

Before upload: Calculate SHA-256 checksum
After upload: Download first 1MB and verify partial checksum
Monthly test: Full download and checksum verification
Before restore: Verify checksum matches metadata

Why? Detects:

Silent data corruption during transfer
Bitrot in storage media
Tampering attempts
Incomplete uploads

Immutability Enforcement

S3 Object Lock (COMPLIANCE Mode):

yaml

Configuration:
    Mode: COMPLIANCE
    Retention: 7 years from creation date

Guarantees:
    - Cannot be deleted by anyone (even AWS root account)
    - Cannot be modified (append-only)
    - Cannot shorten retention period
    - Can only be deleted after 7 years expire

Legal Basis:
    - HIPAA: 6-year medical record retention
    - State: 7-year financial audit window
    - GDPR: Allows retention for legal compliance (Art. 17(3))

Ransomware Protection: Even if an attacker:

Compromises AWS credentials
Deletes production database
Deletes RDS automated snapshots and PITR retention
Attempts to delete S3 backups

Result: Backups survive. Object Lock prevents deletion.

Backup Security

Access Control

yaml

Who Can Access Backups:
    Production Database (RDS):
        - Application Fargate task role (read/write via pgbouncer, RLS-scoped)
        - Migration ECS task role (DDL, bypasses pgbouncer with DATABASE_DIRECT_URL)
        - Database administrator IAM role (superadmin, used only via SSM Session Manager)

    Layer 1 Backups (RDS automated snapshots + PITR):
        - Same RDS-account-scoped IAM controls as the live DB
        - Manual snapshot creation requires the operations IAM role

    Layer 2 Backups (Our S3):
        - Automated backup job (write-only IAM role on the backup bucket)
        - Database administrator (read-only for restore)
        - Security team (read-only for audit)
        - Bucket has separate KMS context from RDS — compromised RDS key cannot
          decrypt Layer 2

    Layer 3 Backups (Cross-region replica):
        - Replication service account (write-only)
        - CTO + on-call lead (read-only for disaster recovery)

    Layer 4 Backups (Offline, optional):
        - Physical access: CTO + COO (dual-custody)

Principle: Minimum necessary access, separation of duties, separate trust
           domains across layers (compromised credentials at one layer cannot
           reach the next).

Encryption Keys

yaml

Key Hierarchy:
    Application Data Encryption:
        - Purpose: Encrypt sensitive fields (phone, API keys)
        - Storage: AWS Secrets Manager
        - Rotation: Quarterly

    Backup Encryption:
        - Purpose: Encrypt backup files before S3 upload
        - Storage: Separate from application keys (AWS Secrets Manager)
        - Rotation: Annually
        - Why separate? If app keys compromised, backups remain safe

    S3 Server-Side Encryption:
        - Purpose: Encryption at rest in S3
        - Storage: AWS-managed keys (SSE-S3)
        - Rotation: Automatic (AWS handles)

Key Backup: All encryption keys backed up to:

Password manager (1Password/Bitwarden) - shared vault, restricted access
Printed copy in physical safe (disaster recovery)

State Audit Compliance

What Auditors Will Request

Based on typical state insurance audits:

Request	How We Provide It	Source
"Prove services were delivered for Patient X in 2023"	Export exercise logs, appointments, signed forms	Layer 2/3 backup (historical)
"Show all payments received vs services delivered"	Cross-reference appointments with invoices	Audit log + backup
"Prove this data hasn't been tampered with"	SHA-256 checksums, immutable S3 Object Lock	Backup metadata
"How do you prevent data loss?"	This document + test reports	Documentation
"Show me a backup from 5 years ago"	Restore from Layer 2 (Glacier Deep Archive)	S3 lifecycle retrieval
"What if your cloud provider fails?"	Layer 3 cross-region backup	Alternative provider restore
"Prove patients consented to treatment"	Signed consent forms with timestamps	Forms backup (status='signed')

Audit-Ready Documentation

Maintain in a physical binder (for in-person audits):

This backup strategy document (printed)
Monthly backup test reports (last 12 months)
Quarterly DR drill reports (last 4 quarters)
Backup retention policy (signed by CTO)
Data processing agreement with AWS (DPA + signed BAA)
Sub-processor list (Cloudflare, Clerk, Daily.co, Anthropic — see external-providers.md)
Encryption key rotation logs (dates only, not keys)
Incident response plan (see monitoring.md)

Operational Runbooks

Runbook 1: Restore from RDS PITR (Minor Issues)

When to Use: Accidental DELETE/UPDATE, recent data corruption, anything inside the 7-day retention window.

Steps:

Identify exact timestamp of corruption (check audit_log for the offending action; the audit row carries created_at and the actor)
Open the AWS console → RDS → the production cluster
Action → "Restore to point in time"
Select restore time (up to second precision within the retention window)
Critical: restore into a NEW DB instance, not by overwriting the live one. New instance name pattern: restartix-prod-pitr-YYYYMMDDhhmm
Wait for restore (typically 15–30 minutes for a db.t4g.medium-sized DB)
Connect via SSM Session Manager port forwarding to the new instance and validate: row counts, the specific data that was lost, RLS policy presence (\d+ patients etc.)
If correct, choose recovery path:
- Option A (preferred for partial recovery): export the recovered rows from the new instance and INSERT … ON CONFLICT them back into the live primary. No application downtime.
- Option B (for catastrophic recovery): point the application at the new instance by updating Secrets Manager DATABASE_URL and forcing an ECS service restart. The old instance becomes an evidence artifact.
After verifying, delete the temporary instance OR snapshot it for later evidence
Document the incident in the audit-log via the operations IAM role (action database.pitr_restore, with timestamps and decision rationale)

RTO: 30–60 minutes (most of which is RDS spin-up time, not user-facing) RPO: 0 seconds (PITR is per-second within the 7-day retention window)

Runbook 2: Restore from Daily Backup (Database Corruption)

When to Use: RDS automated backups unavailable, major corruption, data older than the 7-day PITR window, AWS-account-level compromise.

Steps:

Identify target restore date

Download backup from S3:

aws s3 cp s3://restartix-backups-primary/daily/YYYY-MM-DD-core-full.pgdump.gz.enc ./

Verify checksum:

sha256sum YYYY-MM-DD-core-full.pgdump.gz.enc
# Compare with metadata file

Decrypt:

openssl enc -d -aes-256-cbc -in backup.enc -out backup.pgdump.gz -k $BACKUP_ENCRYPTION_KEY

Decompress:
```
gunzip backup.pgdump.gz
```
Provision new PostgreSQL instance (any PostgreSQL 17 — fresh RDS in another region, RDS in another AWS account, self-hosted on Hetzner / a different cloud — vendor independence is the point of Layer 2)

Restore:

pg_restore -d restartix_platform -v backup.pgdump

Verify:
- Row counts per organization
- Sample data spot-checks
- Application can connect
Switch application connection string to restored instance
Monitor for issues (check logs, error rates)
Document incident

RTO: 1-2 hours RPO: < 24 hours

Runbook 3: Restore from Cross-Region Backup (Regional Disaster)

When to Use: AWS region failure, primary S3 bucket unavailable, account-scoped credential compromise.

Steps:

Access cross-region backup bucket:

aws s3 ls s3://restartix-backups-replica/weekly/

Download most recent weekly backup
Follow Runbook 2 steps 3-11 (same restore procedure)
Provision instance in DIFFERENT region
Update DNS / load balancer to point to new region

RTO: 2-4 hours RPO: < 7 days (weekly backup)

Runbook 4: Restore from Offline Archive (Catastrophic Scenario)

When to Use: All cloud infrastructure compromised/unavailable

Steps:

Retrieve offline backup media from physical safe (requires dual-custody)
Connect encrypted drive to secure workstation (air-gapped)
Decrypt and extract backup
Provision PostgreSQL instance (on-premises or different cloud provider)
Follow Runbook 2 steps 6-11
Manually configure application deployment to new infrastructure

RTO: 8-24 hours RPO: < 90 days (quarterly backup)

Monitoring and Alerting

Backup Health Metrics

Metric	Alert Threshold	Severity	Action
Daily backup failed	1 failure	Critical	Page on-call, investigate immediately
Backup size anomaly	±50% from expected	High	Verify data integrity, check for corruption
Backup upload incomplete	Any incomplete	Critical	Retry upload, verify network
Checksum mismatch	Any mismatch	Critical	Re-run backup, investigate corruption
S3 bucket replication lag	> 24 hours	Medium	Check replication rules, AWS status
Monthly restore test failed	Any failure	High	Debug restore procedure, fix issues
Backup older than 25 hours	No new backup in 25h	High	Check backup ECS task logs, RDS connectivity

Dashboards

Grafana / Datadog:

Backup job success rate (7-day trend)
Backup file sizes (detect growth anomalies)
Restore test results (monthly pass/fail)
S3 storage costs (budget monitoring)
Time to complete backup (performance trend)

Cost Summary

Backup-specific cost (excludes the live database, which is itemized in aws-infrastructure.md → Cost: production day 1):

Layer	Provider	Monthly Cost	Annual Cost	Purpose
1: RDS PITR + snapshots	AWS RDS	Included in RDS	Included	Fast PITR (7 days)
2: Daily logical	AWS S3	~$160	~$1,920	Primary long-term safety
3: Cross-region (EU)	AWS S3	~$60	~$720	Geographic redundancy
4: Offline (optional)	External SSD	~$17	~$200	Audit compliance (if required)
Backup total		~$220–237	~$2,640–2,840	Full DR posture

Cost assumes a 500 GB database with ~1 GB daily growth and 90% pg_dump compression — the same scale model as the original target of "1000 orgs / 500 GB DB" used elsewhere in this document. Actual cost scales linearly with DB size.

Per-Organization Cost: ~$0.22–0.24/month for comprehensive backup protection (at the 1000-org reference scale).

ROI Calculation:

Cost of data loss: Inability to claim insurance reimbursements + fraud liability + legal costs = Millions of euros
Cost of backup: €6,000-9,000/year
ROI: Infinite (prevents catastrophic loss)

Implementation Status

The backup architecture is implemented in two waves: what closes with Foundation 1E.3 (AWS staging deployment) and what closes after the launch.

Closes with 1E.3 (foundation gate, before any production data)

The 1E.3 scope validates that the substrate works — Layer 1 active, Layer 2 IaC ships and the runbook is exercised end-to-end at least once. Whether the daily Layer 2 cron fires daily in staging is a separate knob (see "Staging knobs" below); 1E.3 doesn't require the cron to keep running.

Layer 1 (RDS / Aurora PITR + snapshots):

[ ] RDS Multi-AZ in eu-central-1 with 7-day automated backup retention + PITR
[ ] Aurora Serverless v2 in staging with 1-day backup retention
[ ] CloudWatch alarms for BackupRetentionPeriodStorageUsed, replica lag (Multi-AZ), failed snapshots

Layer 2 (daily pg_dump to S3) — IaC + one validated end-to-end test:

[ ] S3 backup bucket (restartix-backups-primary-{env}) with versioning enabled
[ ] S3 Object Lock in COMPLIANCE mode, 7-year retention on every object
[ ] S3 lifecycle policies (Standard → Glacier IA at 90 days → Glacier Deep Archive at 365 days)
[ ] BACKUP_ENCRYPTION_KEY provisioned in Secrets Manager, separate from application column-encryption keys
[ ] Daily pg_dump ECS task definition + EventBridge Scheduler rule provisioned (production schedule on; staging schedule off by default — see knobs)
[ ] IAM role for the backup task (S3 write-only on the backup bucket, RDS read on the database)
[ ] CloudWatch alarms for backup-failure, backup-too-old, checksum-mismatch (wired up; signal threshold tuning happens after the cron starts firing in production)
[ ] One manual end-to-end run against staging passes: pg_dump → gzip → encrypt → S3 upload → checksum verify → metadata recorded → restore-from-this-artifact runbook (Runbook 2) restores cleanly to a temporary RDS instance

The Terraform module that provisions this is the same module production reuses. Production launch should not be the first terraform apply of this code.

Staging knobs (turn on as needed before production launch)

Daily Layer 2 cron in staging. Off by default after 1E.3 (no real data, no cost benefit, just generates noise). Enable when a) tuning the backup-failure / backup-too-old alarm thresholds, or b) running a production-launch dress rehearsal with migrated legacy data. EventBridge schedule is a Terraform variable — no code change needed to flip.

Closes before production launch (operational gate, separate from F11)

[ ] Production-launch dress rehearsal: backup runs against the migrated legacy data, restore is exercised, all alarms are calibrated against real signals
[ ] On-call understands Runbook 2 (restore-from-daily-backup) end-to-end

Closes after launch (within first quarter)

[ ] Cross-region replication to a second EU region (Layer 3, weekly cadence)
[ ] Monthly automated restore test (Layer 2 to ephemeral DB, validation queries, alerting)
[ ] Quarterly DR drill (rotating scenarios per the matrix above)
[ ] Audit-compliance binder populated with the first quarter's reports

Ongoing

Monthly automated restore tests
Quarterly DR drills (team exercise)
Annual audit preparation
Review backup strategy yearly; update for any infra changes

Database Overview - All tables and multi-tenant architecture
RLS Policies - Data isolation and security
Encryption - Data protection at rest and in transit
GDPR Compliance - Data retention and erasure
Monitoring - Alerting and incident response
Audit Log - Audit trail for compliance

Appendix: Fraud Prevention Evidence Requirements

What Data Proves Services Were Delivered?

For state insurance audits, the following data constitutes proof of service:

Evidence Type	Data Source	Retention	Why It Matters
Appointment attendance	`appointments.status = 'done'`	7 years	Proves patient attended session
Exercise/therapy logs	(Future feature - telemetry service)	7 years	Proves exercises were performed
Video call metadata	`appointments.daily_room_name` + Daily.co logs	7 years	Proves real-time interaction occurred
Specialist notes	`appointment_documents` (reports)	7 years	Medical documentation of session
Patient consent	`forms.status = 'signed'`	7 years	Proves patient authorized treatment
Prescription issuance	`appointment_documents` (prescriptions)	7 years	Proves medical care provided
Payment records	(External billing system)	7 years	Cross-reference with services

Critical: Without backups, you cannot produce this evidence. Insurance claims can be retroactively denied up to 7 years later.

Example Audit Query

Auditor requests: "Prove services delivered for Patient ID 12345 in July 2023"

Our Response (from backup):

Restore July 2023 backup

Query:

sql

SELECT
  a.started_at,
  a.ended_at,
  a.status,
  s.name AS specialist_name,
  (SELECT COUNT(*) FROM forms WHERE appointment_id = a.id AND status = 'signed') AS signed_forms,
  (SELECT COUNT(*) FROM appointment_documents WHERE appointment_id = a.id) AS documents_generated
FROM appointments a
JOIN specialists s ON a.specialist_id = s.id
WHERE a.patient_id = 12345
  AND a.started_at BETWEEN '2023-07-01' AND '2023-07-31'
  AND a.status = 'done';

Export signed consent forms (PDF)
Export prescription/report documents (PDF)
Provide to auditor with checksums (proof of authenticity)

Result: Audit passed, no fraud accusations, insurance reimbursements validated.

Questions for Legal/Compliance Team

Before finalizing backup strategy, confirm with legal counsel:

Retention period: Is 7 years sufficient, or does your state require longer?
Offline backup: Does state audit explicitly require physical/offline backups?
Geographic requirements: Must backups be stored within EU? Or can cross-region be US?
Data sovereignty: Are there restrictions on cloud provider jurisdiction?
Encryption standards: Are AES-256 and current key management procedures compliant?
Audit frequency: How often should we expect state audits? (Affects test schedule)
Evidence format: Do auditors require specific export formats (PDF, CSV, etc.)?

Action: Schedule meeting with legal team to review this document and confirm compliance requirements.

Document Version History

Version	Date	Author	Changes
1.0	2026-02-15	Engineering Team	Initial backup strategy for state-funded insurance compliance
2.0	2026-05-07	Engineering Team	Reframed Layer 0 / Layer 1 from Neon to AWS RDS + Aurora Serverless v2; cross-region target moved to a second EU region for GDPR compliance; runbooks updated for RDS PITR; implementation timeline aligned with Foundation 1E.3

Next Steps: Implement during 1E.3 (Foundation gate); first DR drill within the quarter after launch.

Backup and Disaster Recovery Strategy ​

Executive Summary ​

Compliance and Legal Requirements ​

Data Retention Mandates ​

Why 7 Years? ​

Backup Architecture: The 3-2-1-1 Rule ​

Disaster Recovery Scenarios ​

Scenario Matrix ​

Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) ​

Implementation Details ​

Layer 1: AWS-Managed Backups (RDS + Aurora) ​

Layer 2: Daily Logical Backups (CRITICAL LAYER) ​

Layer 3: Weekly Cross-Region Replication ​

Layer 4: Quarterly Offline Archive (OPTIONAL) ​

Backup Testing and Validation ​

Monthly Restore Test (Automated) ​

Quarterly Disaster Recovery Drill ​

Annual Audit Compliance Test ​

Data Integrity and Immutability ​

Cryptographic Verification ​

Immutability Enforcement ​

Backup Security ​

Access Control ​

Encryption Keys ​

State Audit Compliance ​

What Auditors Will Request ​

Audit-Ready Documentation ​

Operational Runbooks ​

Runbook 1: Restore from RDS PITR (Minor Issues) ​

Runbook 2: Restore from Daily Backup (Database Corruption) ​

Runbook 3: Restore from Cross-Region Backup (Regional Disaster) ​

Runbook 4: Restore from Offline Archive (Catastrophic Scenario) ​

Monitoring and Alerting ​

Backup Health Metrics ​

Dashboards ​

Cost Summary ​

Implementation Status ​

Closes with 1E.3 (foundation gate, before any production data) ​

Staging knobs (turn on as needed before production launch) ​

Closes before production launch (operational gate, separate from F11) ​

Closes after launch (within first quarter) ​

Ongoing ​

Related Documentation ​

Appendix: Fraud Prevention Evidence Requirements ​

What Data Proves Services Were Delivered? ​

Example Audit Query ​

Questions for Legal/Compliance Team ​

Document Version History ​

Backup and Disaster Recovery Strategy

Executive Summary

Compliance and Legal Requirements

Data Retention Mandates

Why 7 Years?

Backup Architecture: The 3-2-1-1 Rule

Disaster Recovery Scenarios

Scenario Matrix

Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO)

Implementation Details

Layer 1: AWS-Managed Backups (RDS + Aurora)

Layer 2: Daily Logical Backups (CRITICAL LAYER)

Layer 3: Weekly Cross-Region Replication

Layer 4: Quarterly Offline Archive (OPTIONAL)

Backup Testing and Validation

Monthly Restore Test (Automated)

Quarterly Disaster Recovery Drill

Annual Audit Compliance Test

Data Integrity and Immutability

Cryptographic Verification

Immutability Enforcement

Backup Security

Access Control

Encryption Keys

State Audit Compliance

What Auditors Will Request

Audit-Ready Documentation

Operational Runbooks

Runbook 1: Restore from RDS PITR (Minor Issues)

Runbook 2: Restore from Daily Backup (Database Corruption)

Runbook 3: Restore from Cross-Region Backup (Regional Disaster)

Runbook 4: Restore from Offline Archive (Catastrophic Scenario)

Monitoring and Alerting

Backup Health Metrics

Dashboards

Cost Summary

Implementation Status

Closes with 1E.3 (foundation gate, before any production data)

Staging knobs (turn on as needed before production launch)

Closes before production launch (operational gate, separate from F11)

Closes after launch (within first quarter)

Ongoing

Related Documentation

Appendix: Fraud Prevention Evidence Requirements

What Data Proves Services Were Delivered?

Example Audit Query

Questions for Legal/Compliance Team

Document Version History