Watch the webinar
CockroachDB stores all user data (tables, indexes, etc.) and almost all system data in a giant sorted map of key-value pairs. This keyspace is divided into "ranges", contiguous chunks of the keyspace, so that every key can always be found in a single range.
From a SQL perspective, a table and its secondary indexes initially map to a single range, where each key-value pair in the range represents a single row in the table (also called the primary index because the table is sorted by the primary key) or a single row in a secondary index. As soon as that range reaches 64 MiB in size, it splits into two ranges. This process continues for these new ranges as the table and its indexes continue growing.
Raft is a consensus protocol––an algorithm which makes sure that your data is safely stored on multiple machines, and that those machines agree on the current state even if some of them are temporarily disconnected.
Raft organizes all nodes that contain a replica of a range into a group--unsurprisingly called a Raft group. Each replica in a Raft group is either a "leader" or a "follower". The leader, which is elected by Raft and long-lived, coordinates all writes to the Raft group. It heartbeats followers periodically and keeps their logs replicated. In the absence of heartbeats, followers become candidates after randomized election timeouts and proceed to hold new leader elections.
Once a node receives a BatchRequest for a range it contains, it converts those KV operations into Raft commands. Those commands are proposed to the Raft group leader––which is what makes it ideal for the leaseholder and the Raft leader to be one in the same––and written to the Raft log.
For a great overview of Raft, we recommend The Secret Lives of Data.
For each range, one of the replicas is the "leader" for write requests. Via the Raft consensus protocol, this replica ensures that a majority of replicas (the leader and enough followers) agree, based on their Raft logs, before committing the write. The Raft leader is almost always the same replica as the leaseholder. This has a tremendous impact on latencies, which we demo in the webinar.
When your clients are in a single geographic region, choosing a topology is straightforward.
When your clients are in multiple geographic regions, it is important to deploy your cluster across regions properly and then carefully choose the right topology for each of your tables. Not doing so can result in unexpected latency and resiliency. There are a number of different multi-region topologies ot choose from depending on your table requirements. The options are below. Details for each option are available in our docs stable and are discussed in the webinar.
Additional Resources