Skip to content

Backup and Disaster Recovery Strategy

Executive Summary

RestartiX processes state-funded insurance claims where exercise and therapy data serves as proof of service delivery. Loss of this data would:

  • Prevent reimbursement claims from insurance providers
  • Eliminate fraud prevention evidence
  • Expose the organization to legal liability
  • Violate HIPAA 6-year medical record retention requirements
  • Undermine audit compliance for state funding

Therefore: Data backup is not optional. This document defines a defense-in-depth backup strategy with multiple independent layers.


Data Retention Mandates

Data CategoryRetention PeriodLegal BasisConsequence of Loss
Exercise/therapy logs7 yearsState insurance fraud preventionCannot prove services delivered, risk fraud accusations
Appointments6 yearsHIPAA medical recordsCannot defend malpractice claims
Signed consent forms6 yearsGDPR Art. 7 + HIPAACannot prove patient consent
Prescriptions/reports6 yearsHIPAAMedical-legal liability
Audit logs6 yearsHIPAA §164.312(b)Cannot prove security compliance
Insurance claim metadata7 yearsState financial audit requirementsCannot reconcile payments

Why 7 Years?

State financial audits can request records up to 7 years retroactively. If you cannot produce exercise logs proving services were delivered, you may be required to refund state payments and face fraud investigations.


Backup Architecture: The 3-2-1-1 Rule

We implement an enhanced version of the industry-standard 3-2-1 rule, adding a fourth layer for state compliance:

┌─────────────────────────────────────────────────────────────────┐
│  LAYER 0: LIVE PRODUCTION DATABASE                              │
│  - Production: AWS RDS Postgres 17, Multi-AZ                    │
│  - Staging:    Aurora Serverless v2, single-AZ                  │
│  - RLS-enforced multi-tenant isolation                          │
│  - Real-time data, constantly changing                          │
│  - RPO: 0 seconds (no data loss tolerance during operation)     │
│  - See [aws-infrastructure.md](/architecture/aws-infrastructure)│
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  LAYER 1: AWS-MANAGED BACKUPS (VENDOR-CONTROLLED)               │
│  - Continuous WAL streaming → point-in-time recovery (PITR)     │
│  - Production retention: 7 days (extensible to 35)              │
│  - Staging retention: 1 day (staging-grade)                     │
│  - Daily automated snapshots                                    │
│  - Manual snapshots before risky deploys                        │
│  - Cross-AZ standby (production) auto-promotes on failure       │
│  - Fast restore (minutes for PITR; hours for snapshot copy)     │
│                                                                  │
│  ⚠️  LIMITATION: AWS-account-scoped. Account compromise or      │
│      catastrophic AWS failure makes this layer unrecoverable.   │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  LAYER 2: DAILY LOGICAL BACKUPS (OUR CONTROL - PRIMARY SAFETY) │
│  - Daily pg_dump to S3 (separate bucket from Layer 0)          │
│  - Encrypted with separate envelope key (BACKUP_ENCRYPTION_KEY) │
│  - Versioned and immutable (S3 Object Lock COMPLIANCE mode)    │
│  - Retention: 7 years (lifecycled to Glacier Deep Archive)     │
│  - Format: Custom PostgreSQL dump (compressed)                  │
│                                                                  │
│  ✅ GUARANTEES:                                                 │
│     - Restoreable to any PostgreSQL 17 instance                 │
│     - Survives RDS-account-level compromise (separate IAM)      │
│     - Protected from ransomware (immutable storage)             │
│     - Meets state audit requirements                            │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  LAYER 3: WEEKLY CROSS-REGION REPLICATION (GEOGRAPHIC SAFETY)  │
│  - Weekly copy of Layer 2 backups to a different EU region     │
│  - Source: eu-central-1 (Frankfurt) → Replica: eu-west-1       │
│    (Ireland) or eu-west-3 (Paris)                              │
│  - GDPR: replication target stays inside the EU                │
│  - Protection against regional disasters                        │
│  - Same retention: 7 years                                      │
│  - Same immutability guarantees                                 │
│                                                                  │
│  ✅ GUARANTEES:                                                 │
│     - Survives entire AWS region failure                        │
│     - Survives natural disasters (floods, fires)                │
│     - Survives data-residency-preserving geographic events      │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  LAYER 4: QUARTERLY OFFLINE ARCHIVE (COLD STORAGE - OPTIONAL)  │
│  - Quarterly snapshot exported to air-gapped storage            │
│  - Stored offline (disconnected from network)                   │
│  - Physical media or encrypted external drive                   │
│  - Retention: 7 years                                           │
│                                                                  │
│  ✅ GUARANTEES:                                                 │
│     - Survives complete cloud infrastructure compromise         │
│     - Ultimate protection against ransomware                    │
│     - Physical possession for legal/audit purposes              │
│                                                                  │
│  ⚠️  TRADE-OFF: Manual process, slower restore time            │
│  📋 RECOMMENDATION: Only if state audits explicitly require     │
└─────────────────────────────────────────────────────────────────┘

Disaster Recovery Scenarios

Scenario Matrix

DisasterLayer 1 (RDS PITR)Layer 2 (Our S3)Layer 3 (Cross-Region)Layer 4 (Offline)
Accidental DELETE query✅ Restore via PITR (5 min)✅ Restore from yesterday (30 min)✅ Restore from last week (1 hour)✅ Restore from last quarter (4 hours)
Ransomware encrypts database❌ Encrypted✅ Immutable backup survives✅ Geographic copy survives✅ Offline copy survives
RDS AZ outage✅ Multi-AZ auto-failover (~60s)✅ Restore to new instance (1 hour)✅ Restore from replica (1 hour)✅ Restore from offline (4 hours)
AWS account compromised❌ All AWS resources at risk✅ Separate IAM, survives✅ Separate region scope, survives✅ Full restore to any Postgres
AWS S3 regional failure✅ RDS still operational❌ Primary backups unavailable✅ Cross-region copy available✅ Offline copy available
Developer accidentally drops table✅ PITR restore (10 min)✅ Restore specific table (20 min)✅ Restore specific table (30 min)✅ Restore specific table (2 hours)
State audit requests 5-year-old data❌ Outside retention window✅ Retrieve from S3 archive✅ Retrieve from cross-region✅ Retrieve from offline archive
Hacker deletes RDS snapshots❌ AWS-scoped backups compromised✅ Separate credentials, survives✅ Separate credentials, survives✅ Air-gapped, survives
Complete internet/cloud collapse❌ Inaccessible❌ Inaccessible❌ Inaccessible✅ Physical possession

Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO)

ScenarioRTO (Max Downtime)RPO (Max Data Loss)Recovery Source
AZ failure (production)~60 seconds0 secondsRDS Multi-AZ auto-failover
Minor data corruption15 minutes0 secondsRDS PITR
Table accidentally dropped30 minutes< 1 hourRDS PITR or daily backup
Database-wide corruption2 hours< 24 hoursDaily backup (Layer 2)
Regional disaster4 hours< 24 hoursCross-region backup (Layer 3)
Complete provider failure8 hours< 7 daysWeekly cross-region + daily backups
Catastrophic global event24 hours< 90 daysOffline archive (Layer 4)

Implementation Details

Layer 1: AWS-Managed Backups (RDS + Aurora)

What RDS provides (production):

  • Continuous backup via Write-Ahead Log (WAL) streaming to S3 (AWS-internal, separate from our Layer 2 bucket)
  • Point-in-time recovery (PITR) to any second within retention window
  • Daily automated snapshots, retained for the same window
  • Manual snapshots — taken before risky deploys, retained indefinitely until explicitly deleted
  • Multi-AZ synchronous standby — auto-promoted on primary failure (~60s, DNS endpoint stable)
  • Backups taken from the standby — zero performance impact on the primary

What Aurora Serverless v2 provides (staging):

  • Same continuous-backup model on Aurora's storage layer
  • PITR within the configured retention window
  • 1-day retention is sufficient for staging-grade

Configuration:

yaml
RDS (production):
  Engine: PostgreSQL 17
  Instance: db.t4g.medium, Multi-AZ
  Storage: 50 GB gp3 (auto-scaling enabled to 200 GB)
  Backup retention: 7 days (extensible to 35)
  Continuous backup: Enabled (PITR to any second within window)
  Snapshot replication: Manual snapshots → Layer 3 cross-region copy
  Encryption at rest: AWS-managed KMS
  Force SSL: Enabled (rds.force_ssl=1)

Aurora Serverless v2 (staging):
  Engine: aurora-postgresql 17
  Capacity: 0.5–2 ACU, scale-to-zero
  Backup retention: 1 day
  Continuous backup: Enabled
  Encryption at rest: AWS-managed KMS

Cost (production):
  See aws-infrastructure.md → Cost: production day 1.
  RDS instance + storage + backups = ~$134/mo at db.t4g.medium Multi-AZ.
  Layer 1 backup-specific cost is included in the RDS line; AWS does not bill PITR
  storage separately as long as backup retention < instance storage size.

Our responsibility:

  • Monitor the AWS Health Dashboard for RDS / Aurora regional events
  • Test PITR restore monthly into a temporary ephemeral DB (see Testing section)
  • Document the restore runbook (Runbook 1 below)
  • Take a manual snapshot before any risky migration or schema change
  • Track backup-window timing (production maintenance window set to a low-traffic UTC slot)

Layer 2: Daily Logical Backups (CRITICAL LAYER)

Why This Layer is Critical:

  • Vendor independence: Can restore to any PostgreSQL provider
  • Fraud defense: Immutable proof of historical data
  • Audit compliance: Long-term retention (7 years)
  • Ransomware protection: Write-once, read-many storage

Backup Schedule:

yaml
Frequency: Daily at 02:00 UTC (low-traffic window)
Method: pg_dump with custom format
Compression: gzip level 9
Encryption: AES-256 (separate key from application encryption)

Storage Strategy:

yaml
Provider: AWS S3 (separate bucket from RDS automated backups; separate IAM,
          separate KMS context, different failure domain at the credentials layer)

Bucket Configuration:
  - Versioning: Enabled
  - Object Lock: COMPLIANCE mode (cannot delete for 7 years)
  - Lifecycle Policy:
      * Days 0-90:   S3 Standard (hot, fast retrieval)
      * Days 91-730: S3 Glacier Instant Retrieval (warm)
      * Days 731+:   S3 Glacier Deep Archive (cold, 12hr retrieval)
  - Encryption: AES-256 (server-side, AWS-managed keys)
  - Replication: Enable to Layer 3 (cross-region)

Naming Convention:
  s3://restartix-backups-primary/
    ├── daily/
    │   ├── 2026-02-15-core-full.pgdump.gz.enc
    │   ├── 2026-02-14-core-full.pgdump.gz.enc
    │   └── ...
    ├── weekly/  (Sunday snapshots, kept separately)
    │   ├── 2026-02-09-core-full.pgdump.gz.enc
    │   └── ...
    └── monthly/ (First of month, kept separately)
        ├── 2026-02-01-core-full.pgdump.gz.enc
        └── ...

Backup Process (Automated):

  1. Export: pg_dump --format=custom from RDS (Multi-AZ standby endpoint to avoid load on primary; read replica when Phase 2 adds them)
  2. Compress: gzip -9 (90% compression ratio typical)
  3. Encrypt: AES-256 with backup-specific envelope key (BACKUP_ENCRYPTION_KEY in Secrets Manager, distinct from application column-encryption keys)
  4. Upload: S3 with metadata (DB size, row counts per org, SHA-256 checksum)
  5. Verify: Download and re-checksum the uploaded object
  6. Alert: Notify on backup failure, incomplete upload, or suspiciously small artifact

The job runs as a scheduled ECS task (EventBridge Scheduler → ECS RunTask), uses IAM credentials scoped to S3 write-only on the backup bucket and RDS read on the database.

Estimated Cost:

yaml
Database Size: 500 GB
Daily Growth: 1 GB
Compression Ratio: 90% (compressed: 50 GB per backup)

Storage Costs (7 years = 2,555 days):
  Year 1 (365 days):
    - Daily backups: 365 × 50 GB = 18.25 TB
    - 90 days × $0.023/GB = $103/month (S3 Standard)
    - 275 days × $0.004/GB = $55/month (Glacier Instant)

  Years 2-7 (all in Glacier Deep Archive):
    - Total: ~100 TB
    - Cost: 100,000 GB × $0.00099/GB = $99/month

Total: ~$160/month (scales with DB size)
Per-org cost: $0.16/month (negligible)

Layer 3: Weekly Cross-Region Replication

Purpose:

  • Geographic redundancy (survive regional disasters)
  • Compliance with state requirements for off-site backups
  • Defense against geopolitical/infrastructure risks

Configuration:

yaml
Source: eu-central-1 (Frankfurt) - Primary backup bucket
Destination: eu-west-1 (Ireland) - Cross-region replica
              [or eu-west-3 (Paris) — both are EU regions and acceptable]

GDPR note: replication target stays inside the EU. US regions are not used
            because GDPR Day-1 compliance requires patient data to remain
            within the EU. See decisions.md → Why clinic is controller, platform
            is processor.

Replication Rule:
  - Frequency: Weekly (Sunday after daily backup completes)
  - What to replicate: Weekly and monthly backups only (not all daily)
  - Storage class: S3 Glacier Instant Retrieval (cheaper, same compliance)
  - Retention: 7 years (same as primary)
  - Encryption: Replicate with same AES-256

Estimated Cost:
  - Storage: ~$50/month (subset of daily backups)
  - Data transfer: ~$10/month (cross-region replication, intra-EU)
  Total: ~$60/month

Layer 4: Quarterly Offline Archive (OPTIONAL)

When to Implement:

  • State auditors explicitly request offline backups
  • High-risk contracts with zero-tolerance data loss clauses
  • Legal requirement for physical evidence custody
  • Enhanced ransomware protection (air-gapped)

Implementation Options:

Option A: Encrypted External Drives

yaml
Hardware:
  - 2TB enterprise-grade external SSD
  - Hardware encryption (FIPS 140-2 certified)

Process:
  1. Quarterly: Download latest monthly backup from S3
  2. Verify checksum
  3. Copy to encrypted drive
  4. Store in physical safe (fireproof, waterproof)
  5. Document in audit log (who, when, where)

Cost: ~€200/year (drive replacement every 3 years)

Option B: Tape Backup (Long-term archive)

yaml
Hardware:
    - LTO-9 tape drive (~€3,000)
    - LTO-9 tapes (~€100/tape, 18TB capacity)

Process:
    - Quarterly: Write backup to tape
    - Store tapes in off-site vault service
    - 30-year shelf life (exceeds 7-year requirement)

Cost: ~€500/year (vault service + tapes)
Recommendation: Only for large institutional deployments with regulatory requirements; not in current platform scope

Recommendation: Start without Layer 4. Add only if:

  • State audit explicitly requires it
  • Legal counsel advises it
  • Insurance policy mandates it

Backup Testing and Validation

Critical Rule: Untested backups are not backups. They are "hopes."

Monthly Restore Test (Automated)

yaml
Schedule: 1st of every month, 03:00 UTC
Duration: ~2 hours
Environment: Isolated staging database (not production)

Test Procedure:
  1. Select random daily backup from previous month
  2. Download from S3
  3. Decrypt
  4. Decompress
  5. Restore to temporary PostgreSQL instance
  6. Run validation queries:
     - Row count per organization
     - Verify RLS policies functional
     - Check foreign key integrity
     - Sample data spot-checks (10 random appointments)
     - Verify encryption keys can decrypt encrypted fields
  7. Generate test report
  8. Alert on-call if ANY validation fails
  9. Destroy temporary instance

Success Criteria:
  - Restore completes without errors
  - All row counts match backup metadata
  - All sampled data is readable and correct
  - Time to restore < 2 hours

Quarterly Disaster Recovery Drill

yaml
Schedule: Last Saturday of quarter
Duration: 4 hours
Participants: Engineering team + CTO

Drill Scenarios (rotate each quarter):
    Q1: RDS regional outage → restore from Layer 2 to a fresh RDS instance
    Q2: S3 bucket compromised → restore from Layer 3 (cross-region replica)
    Q3: Complete provider failure → restore to a different PostgreSQL host
        (e.g., self-hosted on Hetzner, or a non-AWS cloud) to verify vendor
        independence holds
    Q4: Ransomware attack → restore from immutable Object-Locked backup

Success Criteria:
    - Full production database restored to functional state
    - Application can connect and serve requests
    - RTO/RPO targets met
    - All team members understand procedure
    - Runbook updated with lessons learned

Annual Audit Compliance Test

yaml
Schedule: Before annual state audit
Duration: 1 day
Purpose: Prove 7-year retention and data integrity

Test Procedure:
  1. Select 10 random patients from 5-7 years ago
  2. Restore backup from that period (Layer 2 or 3)
  3. Extract their exercise logs, appointments, consent forms
  4. Verify data is complete and unmodified
  5. Generate audit report with:
     - Patient names (anonymized for test)
     - Service dates
     - Exercise/therapy session counts
     - Proof of consent signatures
  6. Present to auditor (if requested)

Success Criteria:
  - All requested historical data retrievable
  - Data matches original records (if cross-referenced)
  - Restore time < 4 hours
  - Data format is human-readable (for auditor review)

Data Integrity and Immutability

Cryptographic Verification

Every backup includes:

json
{
    "backup_id": "2026-02-15-daily-001",
    "timestamp": "2026-02-15T02:00:00Z",
    "database_size_bytes": 524288000,
    "compressed_size_bytes": 52428800,
    "sha256_checksum": "a1b2c3d4e5f6...",
    "encryption_key_version": 2,
    "organization_count": 1000,
    "row_counts": {
        "appointments": 45000,
        "patients": 12000,
        "exercise_logs": 180000,
        "forms": 30000
    }
}

Verification Process:

  1. Before upload: Calculate SHA-256 checksum
  2. After upload: Download first 1MB and verify partial checksum
  3. Monthly test: Full download and checksum verification
  4. Before restore: Verify checksum matches metadata

Why? Detects:

  • Silent data corruption during transfer
  • Bitrot in storage media
  • Tampering attempts
  • Incomplete uploads

Immutability Enforcement

S3 Object Lock (COMPLIANCE Mode):

yaml
Configuration:
    Mode: COMPLIANCE
    Retention: 7 years from creation date

Guarantees:
    - Cannot be deleted by anyone (even AWS root account)
    - Cannot be modified (append-only)
    - Cannot shorten retention period
    - Can only be deleted after 7 years expire

Legal Basis:
    - HIPAA: 6-year medical record retention
    - State: 7-year financial audit window
    - GDPR: Allows retention for legal compliance (Art. 17(3))

Ransomware Protection: Even if an attacker:

  • Compromises AWS credentials
  • Deletes production database
  • Deletes RDS automated snapshots and PITR retention
  • Attempts to delete S3 backups

Result: Backups survive. Object Lock prevents deletion.


Backup Security

Access Control

yaml
Who Can Access Backups:
    Production Database (RDS):
        - Application Fargate task role (read/write via pgbouncer, RLS-scoped)
        - Migration ECS task role (DDL, bypasses pgbouncer with DATABASE_DIRECT_URL)
        - Database administrator IAM role (superadmin, used only via SSM Session Manager)

    Layer 1 Backups (RDS automated snapshots + PITR):
        - Same RDS-account-scoped IAM controls as the live DB
        - Manual snapshot creation requires the operations IAM role

    Layer 2 Backups (Our S3):
        - Automated backup job (write-only IAM role on the backup bucket)
        - Database administrator (read-only for restore)
        - Security team (read-only for audit)
        - Bucket has separate KMS context from RDS — compromised RDS key cannot
          decrypt Layer 2

    Layer 3 Backups (Cross-region replica):
        - Replication service account (write-only)
        - CTO + on-call lead (read-only for disaster recovery)

    Layer 4 Backups (Offline, optional):
        - Physical access: CTO + COO (dual-custody)

Principle: Minimum necessary access, separation of duties, separate trust
           domains across layers (compromised credentials at one layer cannot
           reach the next).

Encryption Keys

yaml
Key Hierarchy:
    Application Data Encryption:
        - Purpose: Encrypt sensitive fields (phone, API keys)
        - Storage: AWS Secrets Manager
        - Rotation: Quarterly

    Backup Encryption:
        - Purpose: Encrypt backup files before S3 upload
        - Storage: Separate from application keys (AWS Secrets Manager)
        - Rotation: Annually
        - Why separate? If app keys compromised, backups remain safe

    S3 Server-Side Encryption:
        - Purpose: Encryption at rest in S3
        - Storage: AWS-managed keys (SSE-S3)
        - Rotation: Automatic (AWS handles)

Key Backup: All encryption keys backed up to:

  1. Password manager (1Password/Bitwarden) - shared vault, restricted access
  2. Printed copy in physical safe (disaster recovery)

State Audit Compliance

What Auditors Will Request

Based on typical state insurance audits:

RequestHow We Provide ItSource
"Prove services were delivered for Patient X in 2023"Export exercise logs, appointments, signed formsLayer 2/3 backup (historical)
"Show all payments received vs services delivered"Cross-reference appointments with invoicesAudit log + backup
"Prove this data hasn't been tampered with"SHA-256 checksums, immutable S3 Object LockBackup metadata
"How do you prevent data loss?"This document + test reportsDocumentation
"Show me a backup from 5 years ago"Restore from Layer 2 (Glacier Deep Archive)S3 lifecycle retrieval
"What if your cloud provider fails?"Layer 3 cross-region backupAlternative provider restore
"Prove patients consented to treatment"Signed consent forms with timestampsForms backup (status='signed')

Audit-Ready Documentation

Maintain in a physical binder (for in-person audits):

  1. This backup strategy document (printed)
  2. Monthly backup test reports (last 12 months)
  3. Quarterly DR drill reports (last 4 quarters)
  4. Backup retention policy (signed by CTO)
  5. Data processing agreement with AWS (DPA + signed BAA)
  6. Sub-processor list (Cloudflare, Clerk, Daily.co, Anthropic — see external-providers.md)
  7. Encryption key rotation logs (dates only, not keys)
  8. Incident response plan (see monitoring.md)

Operational Runbooks

Runbook 1: Restore from RDS PITR (Minor Issues)

When to Use: Accidental DELETE/UPDATE, recent data corruption, anything inside the 7-day retention window.

Steps:

  1. Identify exact timestamp of corruption (check audit_log for the offending action; the audit row carries created_at and the actor)
  2. Open the AWS console → RDS → the production cluster
  3. Action → "Restore to point in time"
  4. Select restore time (up to second precision within the retention window)
  5. Critical: restore into a NEW DB instance, not by overwriting the live one. New instance name pattern: restartix-prod-pitr-YYYYMMDDhhmm
  6. Wait for restore (typically 15–30 minutes for a db.t4g.medium-sized DB)
  7. Connect via SSM Session Manager port forwarding to the new instance and validate: row counts, the specific data that was lost, RLS policy presence (\d+ patients etc.)
  8. If correct, choose recovery path:
    • Option A (preferred for partial recovery): export the recovered rows from the new instance and INSERT … ON CONFLICT them back into the live primary. No application downtime.
    • Option B (for catastrophic recovery): point the application at the new instance by updating Secrets Manager DATABASE_URL and forcing an ECS service restart. The old instance becomes an evidence artifact.
  9. After verifying, delete the temporary instance OR snapshot it for later evidence
  10. Document the incident in the audit-log via the operations IAM role (action database.pitr_restore, with timestamps and decision rationale)

RTO: 30–60 minutes (most of which is RDS spin-up time, not user-facing) RPO: 0 seconds (PITR is per-second within the 7-day retention window)


Runbook 2: Restore from Daily Backup (Database Corruption)

When to Use: RDS automated backups unavailable, major corruption, data older than the 7-day PITR window, AWS-account-level compromise.

Steps:

  1. Identify target restore date
  2. Download backup from S3:
    aws s3 cp s3://restartix-backups-primary/daily/YYYY-MM-DD-core-full.pgdump.gz.enc ./
  3. Verify checksum:
    sha256sum YYYY-MM-DD-core-full.pgdump.gz.enc
    # Compare with metadata file
  4. Decrypt:
    openssl enc -d -aes-256-cbc -in backup.enc -out backup.pgdump.gz -k $BACKUP_ENCRYPTION_KEY
  5. Decompress:
    gunzip backup.pgdump.gz
  6. Provision new PostgreSQL instance (any PostgreSQL 17 — fresh RDS in another region, RDS in another AWS account, self-hosted on Hetzner / a different cloud — vendor independence is the point of Layer 2)
  7. Restore:
    pg_restore -d restartix_platform -v backup.pgdump
  8. Verify:
    • Row counts per organization
    • Sample data spot-checks
    • Application can connect
  9. Switch application connection string to restored instance
  10. Monitor for issues (check logs, error rates)
  11. Document incident

RTO: 1-2 hours RPO: < 24 hours


Runbook 3: Restore from Cross-Region Backup (Regional Disaster)

When to Use: AWS region failure, primary S3 bucket unavailable, account-scoped credential compromise.

Steps:

  1. Access cross-region backup bucket:
    aws s3 ls s3://restartix-backups-replica/weekly/
  2. Download most recent weekly backup
  3. Follow Runbook 2 steps 3-11 (same restore procedure)
  4. Provision instance in DIFFERENT region
  5. Update DNS / load balancer to point to new region

RTO: 2-4 hours RPO: < 7 days (weekly backup)


Runbook 4: Restore from Offline Archive (Catastrophic Scenario)

When to Use: All cloud infrastructure compromised/unavailable

Steps:

  1. Retrieve offline backup media from physical safe (requires dual-custody)
  2. Connect encrypted drive to secure workstation (air-gapped)
  3. Decrypt and extract backup
  4. Provision PostgreSQL instance (on-premises or different cloud provider)
  5. Follow Runbook 2 steps 6-11
  6. Manually configure application deployment to new infrastructure

RTO: 8-24 hours RPO: < 90 days (quarterly backup)


Monitoring and Alerting

Backup Health Metrics

MetricAlert ThresholdSeverityAction
Daily backup failed1 failureCriticalPage on-call, investigate immediately
Backup size anomaly±50% from expectedHighVerify data integrity, check for corruption
Backup upload incompleteAny incompleteCriticalRetry upload, verify network
Checksum mismatchAny mismatchCriticalRe-run backup, investigate corruption
S3 bucket replication lag> 24 hoursMediumCheck replication rules, AWS status
Monthly restore test failedAny failureHighDebug restore procedure, fix issues
Backup older than 25 hoursNo new backup in 25hHighCheck backup ECS task logs, RDS connectivity

Dashboards

Grafana / Datadog:

  • Backup job success rate (7-day trend)
  • Backup file sizes (detect growth anomalies)
  • Restore test results (monthly pass/fail)
  • S3 storage costs (budget monitoring)
  • Time to complete backup (performance trend)

Cost Summary

Backup-specific cost (excludes the live database, which is itemized in aws-infrastructure.md → Cost: production day 1):

LayerProviderMonthly CostAnnual CostPurpose
1: RDS PITR + snapshotsAWS RDSIncluded in RDSIncludedFast PITR (7 days)
2: Daily logicalAWS S3~$160~$1,920Primary long-term safety
3: Cross-region (EU)AWS S3~$60~$720Geographic redundancy
4: Offline (optional)External SSD~$17~$200Audit compliance (if required)
Backup total~$220–237~$2,640–2,840Full DR posture

Cost assumes a 500 GB database with ~1 GB daily growth and 90% pg_dump compression — the same scale model as the original target of "1000 orgs / 500 GB DB" used elsewhere in this document. Actual cost scales linearly with DB size.

Per-Organization Cost: ~$0.22–0.24/month for comprehensive backup protection (at the 1000-org reference scale).

ROI Calculation:

  • Cost of data loss: Inability to claim insurance reimbursements + fraud liability + legal costs = Millions of euros
  • Cost of backup: €6,000-9,000/year
  • ROI: Infinite (prevents catastrophic loss)

Implementation Status

The backup architecture is implemented in two waves: what closes with Foundation 1E.3 (AWS staging deployment) and what closes after the launch.

Closes with 1E.3 (foundation gate, before any production data)

The 1E.3 scope validates that the substrate works — Layer 1 active, Layer 2 IaC ships and the runbook is exercised end-to-end at least once. Whether the daily Layer 2 cron fires daily in staging is a separate knob (see "Staging knobs" below); 1E.3 doesn't require the cron to keep running.

Layer 1 (RDS / Aurora PITR + snapshots):

  • [ ] RDS Multi-AZ in eu-central-1 with 7-day automated backup retention + PITR
  • [ ] Aurora Serverless v2 in staging with 1-day backup retention
  • [ ] CloudWatch alarms for BackupRetentionPeriodStorageUsed, replica lag (Multi-AZ), failed snapshots

Layer 2 (daily pg_dump to S3) — IaC + one validated end-to-end test:

  • [ ] S3 backup bucket (restartix-backups-primary-{env}) with versioning enabled
  • [ ] S3 Object Lock in COMPLIANCE mode, 7-year retention on every object
  • [ ] S3 lifecycle policies (Standard → Glacier IA at 90 days → Glacier Deep Archive at 365 days)
  • [ ] BACKUP_ENCRYPTION_KEY provisioned in Secrets Manager, separate from application column-encryption keys
  • [ ] Daily pg_dump ECS task definition + EventBridge Scheduler rule provisioned (production schedule on; staging schedule off by default — see knobs)
  • [ ] IAM role for the backup task (S3 write-only on the backup bucket, RDS read on the database)
  • [ ] CloudWatch alarms for backup-failure, backup-too-old, checksum-mismatch (wired up; signal threshold tuning happens after the cron starts firing in production)
  • [ ] One manual end-to-end run against staging passes: pg_dump → gzip → encrypt → S3 upload → checksum verify → metadata recorded → restore-from-this-artifact runbook (Runbook 2) restores cleanly to a temporary RDS instance

The Terraform module that provisions this is the same module production reuses. Production launch should not be the first terraform apply of this code.

Staging knobs (turn on as needed before production launch)

  • Daily Layer 2 cron in staging. Off by default after 1E.3 (no real data, no cost benefit, just generates noise). Enable when a) tuning the backup-failure / backup-too-old alarm thresholds, or b) running a production-launch dress rehearsal with migrated legacy data. EventBridge schedule is a Terraform variable — no code change needed to flip.

Closes before production launch (operational gate, separate from F11)

  • [ ] Production-launch dress rehearsal: backup runs against the migrated legacy data, restore is exercised, all alarms are calibrated against real signals
  • [ ] On-call understands Runbook 2 (restore-from-daily-backup) end-to-end

Closes after launch (within first quarter)

  • [ ] Cross-region replication to a second EU region (Layer 3, weekly cadence)
  • [ ] Monthly automated restore test (Layer 2 to ephemeral DB, validation queries, alerting)
  • [ ] Quarterly DR drill (rotating scenarios per the matrix above)
  • [ ] Audit-compliance binder populated with the first quarter's reports

Ongoing

  • Monthly automated restore tests
  • Quarterly DR drills (team exercise)
  • Annual audit preparation
  • Review backup strategy yearly; update for any infra changes


Appendix: Fraud Prevention Evidence Requirements

What Data Proves Services Were Delivered?

For state insurance audits, the following data constitutes proof of service:

Evidence TypeData SourceRetentionWhy It Matters
Appointment attendanceappointments.status = 'done'7 yearsProves patient attended session
Exercise/therapy logs(Future feature - telemetry service)7 yearsProves exercises were performed
Video call metadataappointments.daily_room_name + Daily.co logs7 yearsProves real-time interaction occurred
Specialist notesappointment_documents (reports)7 yearsMedical documentation of session
Patient consentforms.status = 'signed'7 yearsProves patient authorized treatment
Prescription issuanceappointment_documents (prescriptions)7 yearsProves medical care provided
Payment records(External billing system)7 yearsCross-reference with services

Critical: Without backups, you cannot produce this evidence. Insurance claims can be retroactively denied up to 7 years later.

Example Audit Query

Auditor requests: "Prove services delivered for Patient ID 12345 in July 2023"

Our Response (from backup):

  1. Restore July 2023 backup
  2. Query:
    sql
    SELECT
      a.started_at,
      a.ended_at,
      a.status,
      s.name AS specialist_name,
      (SELECT COUNT(*) FROM forms WHERE appointment_id = a.id AND status = 'signed') AS signed_forms,
      (SELECT COUNT(*) FROM appointment_documents WHERE appointment_id = a.id) AS documents_generated
    FROM appointments a
    JOIN specialists s ON a.specialist_id = s.id
    WHERE a.patient_id = 12345
      AND a.started_at BETWEEN '2023-07-01' AND '2023-07-31'
      AND a.status = 'done';
  3. Export signed consent forms (PDF)
  4. Export prescription/report documents (PDF)
  5. Provide to auditor with checksums (proof of authenticity)

Result: Audit passed, no fraud accusations, insurance reimbursements validated.


Before finalizing backup strategy, confirm with legal counsel:

  1. Retention period: Is 7 years sufficient, or does your state require longer?
  2. Offline backup: Does state audit explicitly require physical/offline backups?
  3. Geographic requirements: Must backups be stored within EU? Or can cross-region be US?
  4. Data sovereignty: Are there restrictions on cloud provider jurisdiction?
  5. Encryption standards: Are AES-256 and current key management procedures compliant?
  6. Audit frequency: How often should we expect state audits? (Affects test schedule)
  7. Evidence format: Do auditors require specific export formats (PDF, CSV, etc.)?

Action: Schedule meeting with legal team to review this document and confirm compliance requirements.


Document Version History

VersionDateAuthorChanges
1.02026-02-15Engineering TeamInitial backup strategy for state-funded insurance compliance
2.02026-05-07Engineering TeamReframed Layer 0 / Layer 1 from Neon to AWS RDS + Aurora Serverless v2; cross-region target moved to a second EU region for GDPR compliance; runbooks updated for RDS PITR; implementation timeline aligned with Foundation 1E.3

Next Steps: Implement during 1E.3 (Foundation gate); first DR drill within the quarter after launch.