CapyDB Docs
Operations

Disaster recovery

The actual recovery layers - continuous archiving, base backups, nightly dumps - and the honest limits of a single-region product.

The layers

Recovery is not one mechanism. There are three independent layers, each protecting against a different failure:

LayerCadenceRetentionProtects against
Continuous transaction-log archiving + daily physical base backupsLog segments shipped continuously (forced at least every 60 seconds); base backup daily at 04:00 UTCThe 30 most recent base backups (roughly a 30-day point-in-time window)"We need the database exactly as it was at 14:32" — and total loss of the host
Nightly logical dumps of every project databaseDaily at 02:15 UTC14 daysPer-database corruption or deletion, independent of the archive chain
Your own backups — on demand or scheduledYou chooseYou choose (1–365 days)"Snapshot before the risky thing", named and on your terms

All backup artifacts are written to object storage off the database host, so losing the machine does not lose the backups. Each logical backup is verified by actually restoring it into a throwaway database before it is reported as complete — a backup that does not restore is treated as a failure, not a success.

How point-in-time recovery works

PITR is available in regions with continuous-archiving storage (the project's region info shows whether it applies). When you request a restore to a timestamp:

  1. A temporary recovery instance is materialized from the most recent base backup taken before your timestamp.
  2. The archived transaction log is replayed up to the requested moment.
  3. Your single database is extracted from the recovered instance and loaded into the target — a preview database by default, or production with the explicit overwrite confirmation.

The practical consequences:

  • Granularity is per-database. A PITR restore lands one project's database, not the whole host.
  • The window is bounded by base-backup retention — about 30 days. Beyond that, your own retained backups are the recovery path.
  • Duration scales with data size. Recovery replays a base backup plus up to a day of transaction log, then dumps and reloads your database. Small databases restore in minutes; large ones take as long as they take. Restore into a preview when you want to measure it before you need it.

Realistic RPO and RTO framing

We are not going to print SLA numbers we have not earned. What the mechanics support:

  • Data-loss window (RPO): within the PITR window, restores hit an exact requested timestamp. If the host is lost outright, the recoverable state is everything up to the last archived log segment — archiving is continuous and segments are forced out at least every 60 seconds under normal operation, so the exposure is on the order of the final minute of writes, plus whatever a real incident does to shipping that last segment.
  • Recovery time (RTO): no committed number. It is dominated by database size and the failure mode — a single-database restore is a routine job; rebuilding from a lost host is an operator-driven incident. There is no automatic failover.

Host failure

The provisioning supports a warm standby: a second host streaming replication from the primary over a private network, with replication health (recovery state, replay lag) sampled continuously by the control plane. Failover to a standby is a manual operator action — a human promotes the standby; nothing flips automatically. This is a deliberate trade: automatic failover that fires on a false positive does more damage in a product this size than a human taking minutes to confirm.

Single region, stated plainly

A project lives in one region on dedicated EU infrastructure (Regions). There is no cross-region replication and no multi-region failover today. If your availability requirements demand a database that survives the loss of an entire region without operator involvement, CapyDB is not that product yet — and we would rather you read that here than discover it during an incident.

What you should do

  • Keep scheduled backups on with a retention that matches how far back you ever realistically need to go.
  • Pin a restore point before risky work — a named rollback target beats reconstructing a timestamp.
  • Rehearse: restore a backup into a preview once, now, while nothing is on fire. You will learn your real restore duration and confirm the backups contain what you think they do.
  • For belt-and-braces independence, pg_dump over the direct connection works like on any Postgres — your data is never locked in.