Categories
SQL

Calculating String Lengths

We store a wide variety of data within our platforms, some of which will inevitably be string based. Along with that comes a need to perform manipulation of those strings. When slicing, combining, or manipulating strings one element which we need to consider is the length of the string.

Here I wanted to have a look at two functions for retrieving the length of a string and show considerations when using those approaches, as well as how they compare to each other.

LEN function

The go-to when grabbing the length of a string is using the LEN function. This will typically provide us with the length of a string which we have, for example:

Simple LEN function example

That’s it, done. Right?

Well, not exactly. You see there’s this thing with LEN which I forget about and comes back to bite me now and then. Let’s try the same test again but with a couple of spaces added either side:

LEN function with test string having spaces either side

I’ve added 2 spaces at the start and another 2 at the end. How come the length has only changed from 11 to 13 and not 15? There’s a very specific point in the documentation which covers it:

Returns the number of characters of the specified string expression, excluding trailing spaces.

I don’t know why this only excludes trailing spaces and not leading ones, that feels a little odd. But it does, so what option could we use to avoid that trap?

DATALENGTH function

Another function we have to check the length of a string is the DATALENGTH function which you may have seen me use in last week’s post for recursive functions since we were dealing with trailing spaces.

So how does this handle leading and trailing spaces compared to the LEN function:

Using DATALENGTH function to calculate string length with spaces

That’s much better for us. The DATELENGTH function doesn’t trim any spaces from our string so we can get an accurate length. Well, sort of.

If we try to do the same again but this time we use a Unicode string we’ll get something a little different:

Using DATALENGTH function on Unicode string returning number of bytes

Again the online documentation tells us everything we need to know about this function:

This function returns the number of bytes used to represent any expression.

This is the reason I don’t see much call for the DATALENGTH function as it’s actually telling us the length of the string in terms of the number of bytes it’s taking to store the value. We can also see that when we’re using a fixed length string:

Using DATALENGTH function on fixed length string returning total number of characters

What about LEN function

Looping back to the LEN function briefly here I thought it was worth showing how that function operates when we’re using Unicode and fixed length data types:

Using LEN function on Unicode and fixed length data types

We still have the challenge of the function trimming the trailing spaces from the strings as we saw earlier, however we have results more consistent with what we’re expecting for a string length as opposed to how the DATALENGTH function treats these strings.

So is there an answer to get the correct length for any and all strings?

Comparison

Unfortunately there really isn’t a one size fits all option here. We’ve got 2 very different functions which can give us exactly what we’re after under specific conditions.

We have the LEN option which gives us consistent and broadly expected lengths for our strings, however we need to be mindful not to get tripped up with trialing spaces being removed.

The DATALENGTH function doesn’t trim any characters from our string however it returns for us the number of bytes needed to store the string. This means we can get accurate lengths for our VARCHAR string, however if we’re using Unicode or fixed length strings the function isn’t appropriate for a character length.

There are options to get the functions working in our favour – for example you could replace spaces with another character before using the LEN function but that will get a little bit of string manipulation very bloated very fast.

My personal preference with strings is to stick with the LEN function under most conditions, with a But. That But is that I like to TRIM strings before storing them in the database to avoid any ambiguity in the data. It keeps the data cleaner and more consistent so unless there’s a really good reason not to, I prefer TRIMmed.

TL;DR

As we’ve got a conclusion above I thought I’d summarise the points made here to wrap up with:

  • Length functions are commonly used in SQL Server, we have a couple of options
  • LEN function can catch you out by trimming trailing spaces
  • DATALENGTH function will give an accurate VARCHAR length with trailing spaces
  • DATALENGTH provides byte counts for CHAR, NCHAR and NVARCHAR types
  • LEN function provides accurate results for CHAR, NCHAR and NVARCHAR, excluding trailing spaces
  • Could try to manipulate functions such as REPLACE(@MyString, ' ', '_') but makes code very bloated
  • Try to avoid data with leading / trailing spaces such as using TRIM before storing them

Leave a comment