In SQL Server you can use partitioning to split a table into multiple segments based on the data contained within it. This feature can be used to bring improvements to performance and provide more flexibility with maintenance for those tables which are partitioned. The impact of these will be key for large data sets where resource intensive operations may start restricting your ability to utilise or maintain that volume of data.
In this post I’m looking to cover the basics of what partitioning is, what we need to consider beforehand, and what benefits we’ll be able to see as a result. In future posts we’ll look at how to implement partitioning on a table and providing some examples of the use cases we discuss here.
What is table partitioning
As we’ve already touched on, partitioning is the splitting of a table into multiple segments. You could think of this as having multiple identical table structures each storing a particular range of data. This is all logically separated behind the scenes and what we see when browsing the database is a single entity to query, it can be very transparent.
Splitting the data is achieved by selecting a field within our data as the ‘partitioning column’. We then split the data table based on the values within this column and all values between each boundary are contained together.
The example we had above where data is partitioned based on quarterly sales may be applied to a DATE field and we could choose to have it split on 1st January, 1st April, 1st July, etc. each year. Each quarter of data would be in its own space but they all still collectively make up the same whole table.
By splitting the data in this way we can potentially target queries, operations or maintenance at a specific partition without needing to touch any other part of the same table. We’ll look at those benefits shortly.
How do we partition data
When we come to plan partitioning we need to start with the selection of the column to split our data – the partitioning column. We can use almost any data type we would use as a key for an index as this column, including computed persisted fields if we have them.
That said, with the partitioning we’re trying to chunk the data into manageable sizes and which won’t forever grow uncontrollably like a non-partitioned table can. This means that we usually end up with a date – or a numerical representation of one – as our partitioning column.
Once we’ve decided on our partitioning column we the need to decide what boundaries we want to set for our partitions. In the case of a date being used, this could be aligned to weeks, months, quarters, years, or any date range which makes sense within a given organisation. One of the key drivers for this should be looking at the data volumes which may be in each partition. If there are only 10,000 records in a partition that would likely be too granular. Conversely if a partition contained 100 million records that may be too large for a smaller environment.
These are the two initial choices we need to make when considering partitioning. The right result for both of these will vary based on your own data set and environment. With these in hand we can look at the practical steps to implement this which we’ll cover in a future post.
What are the benefits
Now that we’ve started to think about how we’re going to carve up our data let’s consider where we can see the benefits if we proceeded with the partitioning.
The first area of benefit we can see is when querying the data and through automatic partition elimination. If a value we’ve searched for is specific enough to only reside in one (or a small range) of partitions then the database engine could use the metadata it has for the partitions to completely eliminate need to check some partitions.
The engine knows that those partitions can’t contain any data since there are fixed boundaries we’ve defined on the partitions. This in turn leads to less IO retrieving those partitions and lower memory usage as the objects don’t need to be in memory.
Next up we have the benefits to maintenance routines. With the tables being split into multiple pieces this provides an opportunity to perform maintenance on just a portion of the data by targeting a specific partition or range of them.
For example if you had sales data which is partitioned by quarter you may choose to rebuild the partition covering the current quarter as part of a maintenance routine. Similarly you could choose to apply compression to partitions containing data over 12 months old to reduce storage for infrequently accessed data.
Another aspect of partitioning worth considering is that it will allow you to put each partition into a separate file group to aid with maintenance. You can then separate your data for higher throughput (parallel IO) or to tier your storage, such as older data on slower disks. Through the use of separate file group you you can also perform partial backups or restores of the data in a table if its partitions are spread over multiple file groups.
All of that said, one of my favorite features of partitioning is the ability to switch partitions in or our of a set of data. This can improve the performance in areas such as loading or archiving data where you may need to shift a chunk of data into or out of a table for a specific time period. This is fun to see in action.
Wrap up
Here we’ve had a brief introduction to partitioning of our tables. We’ve covered what it is and means for our data, looked at what we need to do in preparation by deciding what to partition and how, and then covered a few ways in which it may help us to consider where we may want to use the feature.
Next time we’ll follow up with an example of implementing this on a table and seeing how they differ (or not) compared to using a regular non-partitioned table.