T-SQL Tuesday #170 – Learning the Hard Way

This month’s invitation from Reitse asks us to talk about learnings from abandoned or failed projects. This one will sit somewhere squarely between those two points. It was an opportunity to learn about scalability. Let’s set the scene.

A point of sale system being rolled out across hundreds of physical locations. Transaction data collected each night to be batch processed into a warehouse for usual types of analysis. Our integration preference was SSIS internally. A solution was deployed in preparation.

Rolling out of the new system started with a handful of locations which steadily increased as confidence grew. On the back of this the data hitting our solution was increasing too. With a trickle of data early on there were no issues as expected. A small volume of data from a small number of stores. The process flew. We left it doing it’s thing.

Fast forward a couple of months and surpassing 100 locations. With a larger sample, calculations were done to extrapolate execution time at full deployment and estimates were upwards of a full day. The original estimate was to have it wrapped up within the hour. How could it have changed so drastically?

Cutting the story short, we were being sent months of transactions from every site every day. How come? It turned out the solution provider couldn’t identify new or changed records, so we had to take it all. That was out of our control, but we needed to deal with it.

The solution simply wasn’t scalable enough to handle this change.

Throughput had been estimated against expected volumes. It wasn’t factored in to have months of data being received or needing to identify records which didn’t even have changes – daily. We hadn’t considered what if we doubled the size of our organisation. We hadn’t extrapolated the impact of doubling transaction throughput during peak periods.

There was no plan for scalability. That was on us.

What followed was months of refactoring and tuning the solution get ahead of the deployment. Once complete everything was reigned back down to a couple of hours. In the process the vast majority of the solution had been rewritten – at least once.

So enough about a miserable few months of my life, what have we learned here?

Do not leave scalability to an afterthought.

Some solutions clearly need to be more scalable than others. An on-premises HR system for a small business isn’t going to see an influx of adoption. What if you operate a SaaS HR solution across many small business though? Sales or promotions could very quickly draw in new and unexpected business. That’s what we need to consider.

What options do we have to scale our solutions? There would be a few options which come to mind for scaling upwards and outwards which are worth considering with SQL Server:

First up we have vertical scalability as the brute-force approach. Throwing more resource at the issue. This is usually the more straight-forward option, particularly if you’re using a cloud environment. This can be a good consideration for existing solutions where refactoring may be complex or time prohibitive. In terms of designing a new solution there’s little complexity to consider with this approach.

Another approach would be horizontally scale the workload to run in parallel. This could be as running multi-threaded on a single instance to better utilise resources, or running parallel across a number of instances to spread the load. The solution will need to be aware of and manage load and concurrency to work effective. This is easier to build into new developments. Established solutions may take significantly more effort to refactor depending on their existing architecture.

We could also consider improvements to efficiency of access such as read-only replicas, or by using a data cache to reduce or avoid reoccurring requests. Being able to remove the extra workload on an environment can provide headroom for more critical processes. As with the vertical scalability, the way these are implemented may be separate to the solution we’re building, so could be bolted on at a later point.

A final option to consider could be sharding by splitting data up into different sets which can reside in different locations. Examples of this could be based on an application feature or geographic location. Having the data split across different environments allows queries to either be isolated to a single environment, or distributed across multiple. An application accessing this data will need to be aware of where each shard of data resides so this approach is well suited for new developments which can be designed appropriately.

We’re going a little long here so let’s wrap things up.

The solution in our example was horizontal scaling through multi-threading. Running on-premises with licensing restrictions and a relatively narrow application stack it was the sensible choice. We could push more throughput through processing multiple data sets in parallel to better utilise the hardware.

There’s no right or wrong, better or worse options here. They all work and have their benefits and limitations. Some may have more or less cost, complexity, or time commitment associated with them. The take away here is to be aware of them and to design new solutions so that if – and when – they need to scale you’ll understand how that can be done.

So, what’s your scalability plan? How will your next project scale when its usage explodes?

Share this:

Related

2 replies on “T-SQL Tuesday #170 – Learning the Hard Way”

Leave a comment Cancel reply