As we’ve written about previously, geographically distributed databases like CockroachDB offer a number of benefits including reliability, security, and cost-effective deployments. We believe you shouldn’t have to sacrifice these upsides to realize impressive throughput and low latencies. That’s why we created geo-partitioning.
This blog post defines two new features, geo-partitioning and archival-partitioning, as well as explains when you might want to leverage these features. We previously provided a sneak-peak walkthrough of geo-partitioning that can be found here.
Geo-partitioning allows you to keep user data close to the user, which reduces the distance that the data needs to travel, thereby reducing latency and improving user experience. To geo-partition a table, you should define location-based partitions while creating a table, create location-specific zone configurations, and apply the zone configurations to the corresponding partitions.
Archival-partitioning allows you to store infrequently-accessed data on slower and cheaper storage. To archival-partition a table, you should define frequency-based partitions while creating a table, creating frequency-specific zone configurations with appropriate storage devices constraints, and applying the zone configurations to the corresponding partitions.
For a more detailed description of both geo-partitioning and archival-partitioning please consult our documentation.
Companies increasingly insist on low latencies to minimize delays which cause a direct decline in revenue. For example, 100ms of latency costs Amazon 1% in sales and an extra 500 ms in search page generation time dropped Google’s traffic by 20%.
With geo-partitioning, developers can use what they know about a user’s location to keep data close to them while taking advantage of CockroachDB’s durability guarantees of consistency in the face of machine (or even datacenter) failure. Keeping data close to users reduces the distance data needs to travel, thereby reducing latency and improving end-user experience.
Governments continue to introduce new and increasingly tough regulations that put significant financial and organizational pressure on companies. Regulations like the EU GDPR and China’s Cyber Security Law impose restrictions on where data can reside, which undermine some of the core operational and economic benefits of the cloud.
Geo-partitioning restricts the location of data to specific regions (i.e., data-domicile), such as EU countries. Regulations, in some cases (e.g., China), directly require data domiciling. In other cases (e.g., GDPR) restrictions on data location aim to provide data privacy as a fundamental right and allowing companies greater control for managing user data.
Keeping all corporate data in the same storage class is costly and inefficient. This is especially pernicious when the benefits of a solid state drives are not needed for data that’s infrequently accessed or only being preserved for compliance purposes.
By using archival-partitioning data based on storage time, developers can use CockroachDB to move certain types of data to colder, slower, and cheaper storage.
At this point, you may be wondering, hasn’t some form of partitioning always existed in other databases? How is CRDB different?
We built CockroachDB from the ground up as a distributed database. This means that some of the partitioning use cases you may be familiar with in other DBMSs are solved by superior features within CockroachDB. CockroachDB keeps your data in 64MB ranges that the system constantly keeps balanced in response to machine downtime and changes in load. Instead of manually sharding your data into partitions when it's too big or too high traffic to fit on one machine, we transparently (and automatically) shard all data (while allowing you to retain control over where it lives to maintain optimum performance). In the same vein, a simple replication zone metadata change moves data from one machine to another (such as when moving data from high-end storage tier to low-cost storage tier).
Other databases often use partitioning for performance tuning. "Partition pruning" will avoid sending requests to machines containing data that couldn't be in a partition (SELECT … FROM table WHERE date > '2018-01-01'). Our SQL planner does this for every query, whether the tables are partitioned or not, only contacting the relevant 64MB ranges.
We only contact the shards that are relevant for each query, addressing the smallest amount of potentially relevant data. Other databases use partitioning to exclude ranges of data that are not germane to the query in question; if a partition is a-b and the query >c, it will avoid the a-b.
Partitioning can also be used to complete rolling upgrades or schema changes. By contrast, CockroachDB doesn’t need to use partitioning for this feature because we already have online schema changes. Online schema changes allow DBAs to run database changes without interrupting mission critical workloads.
Finally, some databases use partitioning for bulk loading to avoid interrupting business as usual. CockroachDB doesn’t need to use partitioning for bulk import because CRDB’s architecture and automatic replication make it easy to bulk import data using the Import
statement.
To get a better understanding of how geo-partitioning works, you should check out the "Geo Replication" section of this blog post (though the whole thing is worth a read).
We don’t yet support using partitioning to import data to or bulk delete data from an existing table. We view these as logical enhancements of our current work and we plan to support them in the future.
We also aspire to provide a set of features aimed at making regulatory compliance even easier for our users to follow. This includes allowing you to query only from partitioned tables based upon the regulatory conditions faced by their business.
We built CockroachDB as a high-performance distributed database. We hope that by using features like geo-partitioning and archival-partitioning we will continue to “make data easy” for you.
To learn more and try it out yourself, click here.
Illustration by Lea Heinrich
\* *This blog from 2018 does not represent the most recent strategies for pinning data to locations at the row …
Read more
The Baidu DBA team has been an active part of our community for quite some time. Starting in 2015 as open source …
Read more```
Geographically distributed databases like CockroachDB offer a number of benefits including reliability, …
Read more