Backup And Restore Runbook¶
This runbook covers the packaged Postgres backup path for self-hosted control
planes. It applies when controlPlane.backup.enabled=true in the Helm chart.
Snowflake-native deployments use the warehouse durability and recovery model
instead.
What Is Backed Up¶
The backup CronJob runs pg_dump --format=custom against
AGENT_BOM_POSTGRES_URL and uploads the resulting dump to S3-compatible
storage.
Packaged implementation:
- Helm template:
deploy/helm/agent-bom/templates/controlplane-backup-cronjob.yaml - values:
deploy/helm/agent-bom/values.yaml - restore script:
deploy/ops/restore-postgres-backup.sh - CI proof:
.github/workflows/backup-restore.yml
Production Prerequisites¶
Before enabling the CronJob:
- Create a dedicated backup bucket with public access blocked.
- Enable bucket versioning and retention appropriate for your RPO/RTO.
- Use SSE-KMS for regulated environments.
- Attach a least-privilege IRSA or workload-identity role to the backup service account.
- Confirm
pg_dumpandpg_restoremajor versions match the Postgres major version. - Store
AGENT_BOM_POSTGRES_URLin your secret manager, not inline values.
Helm Values¶
Minimal shape:
controlPlane:
backup:
enabled: true
schedule: "0 3 * * *"
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::<account-id>:role/agent-bom-backup
destination:
bucket: agent-bom-prod-backups
prefix: agent-bom/postgres
bucketRegion: "<your-backup-bucket-region>"
encryption:
enabled: true
mode: "aws:kms"
kmsKeyId: alias/agent-bom-backups
Keep bucketRegion explicit. Do not rely on example regions for production.
Manual Backup Check¶
Confirm the latest backup object exists:
export AWS_REGION="<your-backup-bucket-region>"
aws s3 ls "s3://agent-bom-prod-backups/agent-bom/postgres/" \
--region "$AWS_REGION" \
--recursive \
--human-readable \
--summarize
Check the CronJob:
kubectl -n agent-bom get cronjob agent-bom-control-plane-backup
kubectl -n agent-bom get jobs -l app.kubernetes.io/component=control-plane-backup
Restore Drill¶
Restore into a staging database first. Do not restore directly over production until incident command has approved the data-loss window.
export AWS_REGION="<your-backup-bucket-region>"
export BACKUP_URI="s3://agent-bom-prod-backups/agent-bom/postgres/agent-bom-YYYYMMDDTHHMMSSZ.dump"
export RESTORE_POSTGRES_URL="postgresql://agent_bom:REDACTED@staging-postgres:5432/agent_bom"
deploy/ops/restore-postgres-backup.sh \
"$BACKUP_URI" \
"$RESTORE_POSTGRES_URL" \
"$AWS_REGION"
Verify tenant-aware tables after restore:
psql "$RESTORE_POSTGRES_URL" -c "select count(*) from audit_log;"
psql "$RESTORE_POSTGRES_URL" -c "select team_id, count(*) from audit_log group by team_id order by team_id;"
Then run the control-plane smoke checks against the restored environment:
scripts/deploy/verify-eks-reference.sh \
--cluster-name corp-ai \
--region "$AWS_REGION" \
--namespace agent-bom
Production Restore Checklist¶
- Freeze scan schedules and external pushes before restore.
- Snapshot the current database if it is still reachable.
- Restore to a new database or schema first when time allows.
- Verify row counts, tenant distribution, audit export, and auth key state.
- Point the API deployment at the restored database.
- Restart API, worker, gateway, and UI deployments.
- Re-run health, metrics, and signed compliance evidence checks.
- Record the backup URI, operator, approval, start time, end time, and data loss window in the incident record.
CI Guardrail¶
The backup workflow seeds synthetic tenant data, dumps it, wipes the database, restores through the production restore script, and verifies row counts plus tenant distribution. It runs on backup-script/template changes and weekly to catch upstream image drift.