[Answered] What is the difference between PostgreSQL clustering and partitioning?

Answer

PostgreSQL provides two distinct features to manage and optimize data storage: clustering and partitioning. Each serves different purposes and is used based on specific performance or maintenance needs.

Clustering

In PostgreSQL, clustering refers to the operation that physically reorders the data in a table based on the index. This process can improve the performance of the database when fetching rows that are close together according to the index order. The main idea here is to store rows in such a way that rows which are next to each other in the index will also be stored physically close on disk.

It's important to note that this form of clustering is a one-time operation. It does not maintain the clustered order after data modifications (inserts, updates, deletions). Therefore, frequent reclustering might be necessary depending on the workload.

Here’s how you would cluster a table using an index:

-- Assuming you have an index called my_index on my_table
CLUSTER my_table USING my_index;

You could also set your table to automatically cluster on a specified index whenever an explicit CLUSTER command is run without specifying an index:

-- Set the default index for clustering
ALTER TABLE my_table SET WITHOUT CLUSTER;
ALTER TABLE my_table SET WITH CLUSTER my_index;

Partitioning

Partitioning, on the other hand, divides a large table into smaller, more manageable pieces, while still allowing the data to be treated as a single table. This can significantly improve performance in terms of query response time and data maintenance tasks such as backups and deletions. Partitioning is especially useful for very large tables that contain historical data where queries are often focused on a subset of the data.

PostgreSQL supports partitioning mainly through two methods:

Range Partitioning: Dividing the data into partitions based on ranges of values. Commonly used with dates (e.g., data from each year in a separate partition).
List Partitioning: Dividing the data into partitions where each partition holds a specific list of values.

Here's a simple example of range partitioning:

CREATE TABLE measurement (
    city_id int not null,
    logdate date not null,
    peaktemp int,
    unitsales int
) PARTITION BY RANGE (logdate);

-- Creating a partition for each year
CREATE TABLE measurement_y2020 PARTITION OF measurement
    FOR VALUES FROM ('2020-01-01') TO ('2021-01-01');
CREATE TABLE measurement_y2021 PARTITION OF measurement
    FOR VALUES FROM ('2021-01-01') TO ('2022-01-01');

Conclusion

Clustering and partitioning in PostgreSQL serve different optimization purposes: clustering optimizes physical row order based on an index to boost read performance, whereas partitioning splits a table into parts based on logical rules, improving manageability and query performance over subsets of data. The choice between them depends on specific use cases and the nature of the data handled.

Question: What is the difference between PostgreSQL clustering and partitioning?

Answer

Clustering

Partitioning

Conclusion

Was this content helpful?

Next Steps

Other Common PostgreSQL Questions (and Answers)

Free System Design on AWS E-Book

Download this early release of O'Reilly's latest cloud infrastructure e-book: System Design on AWS.

Switch & save up to 80%