My Regular Expressions Love-Hate Relationship

Category : Technical Goodies

I still remember we first met when I was in University where we got off on a rocky start without any positive feelings towards each other due to the lack of understanding and the breakdown in communication. I was literally pulling my hairs out with throbbing headaches while re maintained its stance and stared coldly back at me. The relationship was time-bound and I’m glad it didn’t last as I toss it behind me shortly after.

Lo and behold, re paid me a visit after I graduated into the workplace and we were each given a 2nd shot at each other. I’ve since grown to love re albeit with occasional teeth gritting moments just like today when I engaged re to participate in yet another data issue which was eventually resolved elegantly.

Curious about re? Let me formally introduce or, in some cases, reacquaint you with re whose full name is Regular Expressions. Being in the data management business, I’ve got to admit that Regular Expression is extremely helpful in more ways than one, from simple perl subroutines to complex data cleansing accomplished with a single line of code versus countless sql replace statements. Let me elaborate:

Goal
Concatenate multiple user ids in a single cell with a semicolon delimiter
E.g. 123;456;789

Challenge
Data source contains multiple user ids with various descriptive tags assigned to each id
E.g.  123 (IT Prof ID);456 (Gamer ID),(Generic ID) 789;

Without Regular Expression

DECLARE@String char(50)
SET @String = '123 (IT Prof ID);456 (Gamer ID),(Generic ID) 789;'

SELECT @String OriginalString, @String String INTO #Temp

UPDATE #Temp SET String = REPLACE(String,'(IT Prof ID)','')
UPDATE #Temp SET String = REPLACE(String,'(Gamer ID)','')
UPDATE #Temp SET String = REPLACE(String,'(Generic ID)','')

UPDATE #Temp SET String = REPLACE(String,',',';')
UPDATE #Temp SET String = REPLACE(String,'; ',';')
UPDATE #Temp SET String = REPLACE(String,' ;',';')
UPDATE #Temp SET String = LTRIM(RTRIM(String))
UPDATE #Temp SET String = CASE WHEN RIGHT(String,1) = ';' THEN LEFT(String,LEN(String)-1) ELSE String END

SELECT OriginalString, String /*123;456;789*/
FROM #Temp

DROP TABLE #Temp

With Regular Expression

DECLARE@String char(50)
SET @String = '123 (IT Prof ID);456 (Gamer ID),(Generic ID) 789;'

SELECT @String OriginalString, @String String INTO #Temp

SELECT OriginalString, RegularExpressionString = RegexReplace(';$','',RegexReplace(' *;+ *',';',RegexReplace('\(.+?\)|,',';',OriginalString,1,1),1,1),1,1) FROM #Temp

DROP TABLE #Temp

Notes
Using the usual REPLACE syntax may not capture all potential descriptive tags i.e. for each newly discovered tag, a new REPLACE statement have to be added to the code. Whereas the Regular Expression syntax will capture everything enclosed in ().

Regular Expression Syntax
1. ‘\(.+?\)|,’,’;’ Replace everything enclosed within () with ;
2. ‘ *;+ *’,’;’    Replace any number of ; (e.g. ;;, ;;;, ;;;;, etc) with a single ; and remove any prefix & trailing blankspace
3. ‘;$’,”           Remove ; from the end of the string if exists

Additional notes
Another useful Regular Expression is one which removes html tags:
RegexReplace(‘<(?:[^>””]*|([””]).*?\1)*>|&nbsp;|^[ \t\n\r]+|[ \t\n\r]+$’,, OriginalString,1,1)

Reference

Here’s the code behind the RegexReplace function

CREATE FUNCTION [dbo].[RegExReplace] 
(@Input [nvarchar](512), @Pattern [nvarchar](127), @Replacement [nvarchar](512))
RETURNS [nvarchar](512) WITH EXECUTE AS CALLER
AS 
EXTERNAL NAME [RegularExpressionFunctions].[RegularExpressions].[RegExReplace]
GO

And if you’ll like to start using Regular Expressions i.e. the RegexReplace function or more applications of this wonderful concept – http://www.simple-talk.com/sql/t-sql-programming/tsql-regular-expression-workbench/

Hope this helps any data analyst or business intelligence analyst out there who’s facing challenges with their data.

On a seperate note, if you prefer to flip though the pages or have a handy book at your side: