Regular Expressions: tagging expressions

I re-discovered something very cool while using RegEx in SSMS today: tagging expressions and referencing them in the same line!

I’ll take a minute to define regular expressions, explain tagged expressions, and then show you how to use it in the same line.

But first, XKCD:

An XKCD comic strip where the narrator imagines heroically swooping in to perform RegEx, in order to save a life.

Regular Expressions defined

“A regular expressionregex or regexp [is]… a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation.” – Wikipedia

In SQL Server Management Studio, we use RegEx to find, or find-and-replace, patterns of text in the code. For example, to find a line that ONLY has the characters “GO” on it, we might search for:

^GO$

The carat (^) means beginning of line, the $ means end of line, and GO is a string literal.

What are tagged expressions?

You can tag a group of expressions and later treat them as one group. So for example, if we wanted to prefix all CREATE TABLE statements in a script with a comment marker (———— tablename ———— ) on the previous line, we could do something like this:

FIND: CREATE TABLE {.+}\(
REPLACE WITH:  ------------ \1 ------------ \nCREATE TABLE \1 (

This changes the code CREATE TABLE dbo.TableB ( … to:

------------ dbo.TableB ------------
CREATE TABLE dbo.TableB ( ...

The brackets {} in the FIND expression say, “treat anything in this grouping as a single entity”. (The .+ says “one or more characters”, and the backslash in front of the parenthesis marks it as a literal character, instead of a RegEx expression).

The \1 in the REPLACE WITH expression refers back to that tagged expression ({.+}). It’s the first (and as it turns out, only) expression we tagged, so we refer to it with \1. (You can think of \1 as an automatic variable.)

If we had tagged two expressions, we’d refer to the second expression with \2. (Note that \n in the REPLACE WITH expression says to insert a line break.)

What’s this about using the tagged expression in the same line?

I had an ordered list of server names that I needed to deduplicate. Of course I could insert them to a table and dedupe using T-SQL, but it’s never a bad idea to learn an alternate method.

I paste the list into SSMS:

SQL1
SQL1
SQL1
SQL3
SQL3
SQL3
...

I then create a Regular Expression to find two consecutive lines that contain the exact same characters (server names), and replace that with a single line containing those characters (server name):

FIND: ^{.+}\n\1$
REPLACE WITH: \1

Let’s break down the FIND statement:

  • ^ means beginning of line
  • {} tags the expression contained within
  • .+ means one or more characters
  • \n means line break
  • and \1 refers to that tagged expression we LITERALLY JUST DEFINED

So in English this would read “find a group of characters that take up an entire line, followed by the exact same group of characters that take up another entire line”. Or to be more succinct (and less accurate), “Find a server name on one line, followed by that same server name”.

THAT’S REALLY COOL. I’m used to referencing a tagged expression in the REPLACE WITH field, but it’s pretty neat to tag an expression and then use it right away.

When I run this find-and-replace a handful of times, the list is deduplicated:

SQL1
SQL3
...

Perhaps I am easily pleased, but that’s not a bad thing. I heartily approve of this RegEx feature.

Bonus: A discussion on limitations

First things first: this particular example only works on an ordered list. If the list of server names were not ordered, the regular expression would not find all of the duplicates, and I’d be better off doing my “dedupe using a temp table and TSQL” method.

Second, all of the duplicate entries must be identical. That’s not a big deal in this example, but let’s say we had a list of duplicated table names? There’s more than one accurate way to represent a table name. For example:

  • dbo.Tbl1
  • dbo.[Tbl1]
  •  dbo.[Tbl1]
  • [dbo].Tbl1
  • [dbo].[Tbl1]
  • Tbl1
  • [Tbl1]

Those are all absolutely duplicates, but RegEx does NOT know that! Because ‘dbo.Tbl1’ <> ‘Tbl1’, of course. And if there are tabs or spaces before or after the table name (or server name), we have the same problem! ‘Svr1 ‘ <> ‘Svr1’. The problem is even worse if you’re on a case-sensitive database. And, this particular problem extends to our (implied) TSQL solution, too!

Third, in the last section I said, “to be more succinct (and less accurate), ‘Find a server name on one line, followed by that same server name'”. It’s less accurate to say it this way, because RegEx doesn’t know we’re talking about server names. This particular find-and-replace code just wants to grab a pair of identical lines of text, and turn them into a single line.

So yes, there are absolutely limits on this…but that’s true of any solution, to any problem. We use the tools we have on hand to get the job done, and RegEx is one beautiful jackhammer of a text manipulation tool!