Categories
SQL

Using Regular Expressions in SQL Server

After being around SQL Server for the last 15 years I was surprised recently to see an example of Regular Expression (RegEx) syntax being used within a SQL query and knew it was something I wanted to try out for myself.

What are Regular Expressions?

If you haven’t head of them before, Regular Expressions are used to define a string pattern which can be used for searching text or in some tools as a method of fine and replace. For example in SQL we might search a list of names using LIKE 'And%' so we could find Andy or Andrew, however it may also contain Andrea as these all start with the same string ‘And’. Using Regular Expressions we can get much more fine grained control, so we might use a search string below:

^And(y|rew)$

By using this we’re asserting that at the start of the string we have the letters And as with our LIKE statement, however using the expression we can ensure the end of the string is either y or the sequence rew so we’d only return the two names which we’re looking for.

Lets try them out

Firstly we’ll create a table with a few names and see what we can do in SQL Server. You’ll see in here I’ve included examples with new line characters and spaces as these can be particularly useful areas to pick out with regular expressions:

CREATE TABLE #Regex (
	Username VARCHAR(100)
);

INSERT INTO #Regex (Username)
VALUES ('Andy'), ('Andrew'), ('Andrea'),
	('Andelyn'), ('Dave'), ('Billy'), ('NewLine
	'), (' PreSpace'), ('Mid Space');

In this list we have 4 names which start with ‘And’ so lets see if we can filter those out to the ones we’re after. As a starting point lets try out with the example from above and see how we get on:

SELECT *
FROM #Regex
WHERE Username LIKE 'And%';

SELECT *
FROM #Regex
WHERE Username LIKE 'And(y|rew)';

Well, that’s a little disappointing, we get no results returned from the second query at all.

What’s up with that?

So here’s the thing, if we take the LIKE operator and look at the documentation for it, it provides some details of what we can use within the pattern portion of the wildcard string. It turns out that its only a very small portion of the RegEx syntax which is supported. We can really only define a generic string of characters or specify a particular character.

Adjusting our query slightly we can make sure that the 4th character is either a Y or an R, and then allow any other sequence after. This would look something like:

SELECT *
FROM #Regex
WHERE Username LIKE 'And[yr]%';

This example still pulls up 3 of the 4 names so it isn’t what we’re looking for. Really there isn’t a prettier way of filtering out Andy and Andrew without explicitly putting both of those names into separate Where clauses for this query.

Lets take another look at what we might be able to achieve with the existing functionality. Another element of the regular expressions which is still usable in this way is to check for the length of a string which has characters in specific positions. Similar to the previous example we’ll look for any name which begins with ‘And’ and is exactly 6 characters long. This can be achieved with the following:

/* Using more typical SQL syntax */
SELECT *
FROM #Regex
WHERE Username LIKE 'And%'
	AND LEN(Username) = 6;

/* Now with Regex */
SELECT *
FROM #Regex
WHERE Username LIKE 'And___';

Although we can target only individual characters, we can use the underscore here to specify any single character so it fixes the length of the search and we can roll the two clauses into one, not a bad use case.

When we set up our example table earlier I’d mentioned about regular expressions being particularly useful for filtering characters we wanted to exclude from results. As we can’t define multiple character sequences for matching this isn’t completely possible within the built in functionality however we can still use it to highlight at least some of the erroneous data. Below is an example where we make sure that the start and end of the text begins with a character – note that we’re using a case insensitive database for this so we only need to specify lower case characters – with anything else in between:

SELECT *
FROM #Regex
WHERE Username NOT LIKE '[a-z]%[a-z]'

In this instance we can’t control what’s in the middle of the string so the ‘Mid Space’ name wouldn’t get picked up, however the name which had a preceding space as well as the one which ended in a new line character would get returned by this filter. This could be useful when importing user entered data sources for example where some fields may have been free-typed.

Wrap Up

All together this was a bit of a disappointing dive into the Regular Expression functionality available out of the box in SQL Server. It clearly isn’t designed to be anywhere near as encompassing as full blown support but nevertheless there’s still some use cases where it could come in handy.

Personally I’d love to see full support for regular expressions in the query engine – sure, they’d likely be equally if not less SARGable than existing functionality but on the other hand they’re very specific in the patterns they’re matching which can make their performance much better than workarounds such as splitting strings and comparing individual characters and positions.

One reply on “Using Regular Expressions in SQL Server”

Leave a comment