Physical Cluster Replication Technical Overview

On this page Carat arrow pointing down
Note:

Physical cluster replication is only supported in CockroachDB self-hosted clusters.

Physical cluster replication (PCR) automatically and continuously streams data from an active primary CockroachDB cluster to a passive standby cluster. Each cluster contains: a system virtual cluster and an application virtual cluster:

  • The system virtual cluster manages the cluster's control plane and the replication of the cluster's data. Admins connect to the system virtual cluster to configure and manage the underlying CockroachDB cluster, set up PCR, create and manage a virtual cluster, and observe metrics and logs for the CockroachDB cluster and each virtual cluster.
  • Each other virtual cluster manages its own data plane. Users connect to a virtual cluster by default, rather than the system virtual cluster. To connect to the system virtual cluster, the connection string must be modified. Virtual clusters contain user data and run application workloads. When PCR is enabled, the non-system virtual cluster on both primary and secondary clusters is named main.

This separation of concerns means that the replication stream can operate without affecting work happening in a virtual cluster.

PCR stream start-up sequence

Starting a physical replication stream consists of two jobs: one each on the standby and primary cluster:

  • Standby consumer job: Communicates with the primary cluster via an ordinary SQL connection and is responsible for initiating the replication stream. The consumer job ingests updates from the primary cluster producer job.
  • Primary producer job: Protects data on the primary cluster and sends updates to the standby cluster.

The stream initialization proceeds as follows:

  1. The standby's consumer job connects via its system virtual cluster to the primary cluster and starts the primary cluster's physical stream producer job.
  2. The primary cluster chooses a timestamp at which to start the physical replication stream. Data on the primary is protected from garbage collection until it is replicated to the standby using a protected timestamp.
  3. The primary cluster returns the timestamp and a job ID for the replication job.
  4. The standby cluster retrieves a list of all nodes in the primary cluster. It uses this list to distribute work across all nodes in the standby cluster.
  5. The initial scan runs on the primary and backfills all data from the primary virtual cluster as of the starting timestamp of the replication stream.
  6. Once the initial scan is complete, the primary then begins streaming all changes from the point of the starting timestamp.

Two virtualized clusters with system virtual cluster and application virtual cluster showing the directional stream.

Start-up sequence with read on standby

Note:

This feature is in preview. This feature is subject to change. To share feedback and/or issues, contact Support.

New in v24.3: You can start a PCR stream with the READ VIRTUAL CLUSTER option, which allows you to perform reads on the standby's replicating virtual cluster. When this option is specified, the following additional steps occur during the PCR stream start-up sequence:

  1. The system virtual cluster on the standby also creates a readonly virtual cluster alongside the replicating virtual cluster. The readonly virtual cluster will be offline initially.
  2. After the initial scan of the primary completes, the standby's replicating virtual cluster has a complete snapshot of the latest data on the primary. The PCR job will then start the readonly virtual cluster.
  3. When the startup completes, the readonly virtual cluster will be available to serve read queries. The queries will read from historial data on the replicating virtual cluster. The historical time is determined by the replicated_time of the PCR job (the latest time at which the standby cluster has consistent data). The replicated_time will move forward as the PCR job continues to run.

During the PCR stream

The replication happens at the byte level, which means that the job is unaware of databases, tables, row boundaries, and so on. However, when a failover to the standby cluster is initiated, the replication job ensures that the cluster is in a transactionally consistent state as of a certain point in time. Beyond the application data, the job will also replicate users, privileges, basic zone configuration, and schema changes.

During the job, rangefeeds are periodically emitting resolved timestamps, which is the time where the ingested data is known to be consistent. Resolved timestamps provide a guarantee that there are no new writes from before that timestamp. This allows the standby cluster to move the protected timestamp forward as the replicated timestamp advances. This information is sent to the primary cluster, which allows for garbage collection to continue as the replication stream on the standby cluster advances.

Note:

If the primary cluster does not receive replicated time information from the standby after 24 hours, it cancels the replication job. This ensures that an inactive replication job will not prevent garbage collection. The time at which the job is removed is configurable with ALTER VIRTUAL CLUSTER virtual_cluster EXPIRATION WINDOW = duration syntax.

Failover and promotion process

The tracked replicated time and the advancing protected timestamp allows the replication stream to also track retained time, which is a timestamp in the past indicating the lower bound that the replication stream could fail over to. Therefore, the failover window for a replication job falls between the retained time and the replicated time.

Timeline showing how the failover window is between the retained time and replicated time.

Replication lag is the time between the most up-to-date replicated time and the actual time. While the replication keeps as current as possible to the actual time, this replication lag window is where there is potential for data loss.

For the failover process, the standby cluster waits until it has reached the specified failover time, which can be in the past (retained time), the LATEST timestamp, or in the future. Once that timestamp has been reached, the replication stream stops and any data in the standby cluster that is above the failover time is removed. Depending on how much data the standby needs to revert, this can affect the duration of RTO (recovery time objective).

Note:

When a PCR stream is started with a readonly virtual cluster, the job will delete the readonly virtual cluster automatically if a failover is initiated with a historical timestamp. If the failover is initiated with the most recent replicated time, the readonly virtual cluster will remain on the standby cluster.

After reverting any necessary data, the standby virtual cluster is promoted as available to serve traffic and the replication job ends.

For details on failing back to the primary cluster following a failover, refer to Fail back to the primary cluster.


Yes No
On this page

Yes No