Help Centre

Regex

About the Regex operation

The Regex operation allows you to find data matching a certain pattern which can then be extracted or changed. For example:

Patterns are defined using a language known as 'Regular Expressions', commonly called Regexes.

Regex tutorial

Simpler ways to achieve the same result
We're always trying to make it easier to configure a pipeline. Since we made the Regex video tutorial above, we've introduced simpler ways to do some of the examples shown.
  • Substitute ‘Rd’ for ‘Road’
    You can now use the Substitute function in Calculate
  • Change a four-digit zip code to a five-digit zip code
    In Calculate, you can now use the expression:
    RIGHT(CONCATENATE("00000", 'Column-name'), 5)
  • Change a date format
    Use the Date Cleanse and Date Output operations

Workspace

Column

Select the source column to search.

Regex

Enter the search term or regex pattern.

Output column

Either name a new output column, or select the source column to overwrite it. If no column name is entered, an unnamed column will be created.

Configure output

To remove regex matches, select ‘Remove matched value’.
To replace regex matches select ‘Advanced’ and enter a text string or a result pattern
(e.g. $1, $2, $3).

Search patterns

Enter the search pattern in the Regex field. This could be a specific sequence of text or numbers, such as the word ‘Rd’, though this is usually better dealt with using the Substitute function in Calculate.

The Regex operation is best used to search for patterns, using the Regular Expressions syntax. For example:

See the Syntax reference section for details and examples of the most commonly used Regex syntax.

Testing regexes
Rapidly develop your syntax by testing on a regex utility website, before using it in a pipeline.
The site regex101.com allows you to enter a regex, then test it against a few sample values.
The site Java Regular Expressions Tester also allows you to test a regex, including the output group replacement functionality.

Result patterns

Once a pattern has been matched, it can be either be removed, replaced, modified or repeated. The result can overwrite the source column or can be written to a new column. The $ symbol is used to reference groups within the regex pattern.

Examples

1. Extract any numbers at the start of a string

We want to find the building number from a column of addresses and extract it to a new column. We can do this by writing a regex to identify any sequence of numbers at the start of an address.

  1. Select the source column
    Address
    .
  2. Enter a regex pattern:
    ^(\d+).+

    ^
    anchors to the start of the string.
    (\d+)
    looks for a sequence of digits of any length.
    .+
    indicates that the digits should be followed by any number of characters.

  3. Output to a new column called
    Building No.
  4. Select ‘Advanced’ and enter the result pattern
    $1
    to write the matched patterns to the new column.


2. Remove non-alphanumeric characters

We want to strip out non-alphanumeric characters, such as punctuation or symbols, to help standardise a list of company names. This regex looks any characters that are not words, whitespaces or digits and removes them.

  1. Select the source column
    Column 1
    .
  2. Enter a regex pattern:
    [^\w\s\d]

    [ ]
    matches characters not contained within the brackets
    \w
    matches any word character
    \s
    matches any whitespace character
    \d
    matches any digit
  3. Select the output column
    Column 1
    to overwrite the original column.
  4. Select ‘Remove matched value’.

3. Combine initials

In a list of company names, where initials are separated by a space, we want to remove the space. This regex looks for a single non-whitespace character, followed by one or more spaces. It then writes only the matched non-whitespace characters followed by the rest of the string.

  1. Select the source column
    Column 1
    .
  2. Enter a regex pattern:
    \b(\S)\s+(?=\S\b)

    \b
    asserts position at a word boundary
    (\S)
    matches any single non-whitespace character and defines the capturing group
    \s+
    matches any whitespace character, one or more times
    (?=\S\b)
    Positive lookahead for a non-whitespace character at a word boundary. This marks the start of the rest of the string.

  3. Select the output column
    Column 1
    to overwrite the original column.
  4. Select ‘Advanced’ and enter the result pattern
    $1
    to write the matched non-whitespace characters, followed by rest of the string, to the original column.

Syntax reference

Quantemplate uses the Java version of regex. For full syntax details see the Java Regex Documentation.

Character
Function
Example
|
Or

Matches either the expression before or the expression after
|
gray|grey
matches either “gray” or “grey”
( )
Grouping
gray|grey
and
gr(a|e)y
both match “gray” or “grey”

[ ]
Matches a single character that
is
contained within the brackets
[abc]
matches “a”, “b”, or “c”

gr[ae]y
matches “gray” or “grey”, but not “graey”, “graay”, etc.

[a-z]
specifies a range which matches any lowercase letter
from “a” to “z”

[abcx-z]
matches “a”, “b”, “c”, “x”, “y”, or “z”

[hc]at
matches “hat" and “cat”

[a.c]
matches only “a”, “.”, or “c”

[^ ]
Matches a single character that
is not
contained within the brackets
[^abc]
matches any character other than “a”, “b”, or “c”

[^a-z]
matches any character that is not a lowercase letter
from “a” to “z”

[^b]at
matches all strings matched by .at except “bat”

q[^x]
matches “qu” in “question”. It does not match “Iraq” since there is no character after the q for the negated character class to match

{n}
Preceding item is matched exactly n times
a{3}
matches only “aaa”

[1-9][0-9]{3}
matches a number between 1000 and 9999

{min,}
Preceding item is matched min or more times
a{3,}
matches “aaa”, “aaaa”, “aaaaa”, “aaaaaa”, etc.

a{3,5}
matches “aaa”, “aaaa” and “aaaaa”

{min,max}
Preceding item is matched at least min times, but not more than max times
a{3,5}
matches only “aaa”, “aaaa” and “aaaaa”.

[1-9][0-9]{2,4}
matches a number between 100 and 99999

.
Matches any character
a.c
matches “abc”, “acc”, “adc”, etc.

gr.y
matches “gray”, “grey”, “gr%y”

.at
matches any three-character string ending with “at”, including “hat”, “cat”, and “bat”

?
Matches the preceding character
zero or one
times
ab?c
matches “ac”, “abc”

colou?r
matches “color” and “colour”

[hc]?at
matches “hat”, “cat”, and “at”

*
Matches the preceding character
zero or more
times
ab*c
matches “ac”, “abc”, “abbc”, “abbbc”, etc.

s.*
matches “s” followed by zero or more characters, for example: “s” and “saw” and “seed”

[hc]*at
matches “hat”, “cat”, “hhat”, “chat”, “hcat”, “cchchat”, “at”, etc.

+
Matches the preceding character
one or more
times
ab+c
matches “abc”, “abbc”, “abbbc”, but not “ac”

[hc]+at
matches “hat”, “cat”, “hhat”, “chat”, “hcat”, “cchchat”, etc. but not “at”

^
Anchors the
starting
position in a string
^[hc]at
matches “hat” and “cat”, but only at the beginning of the string

$
Anchors the
ending
position in a string
[hc]at$
matches “hat” and “cat”, but only at the end of the string

\
Escaped

Treats the escaped character as a literal rather than a regex command
\[.\]
matches any single character surrounded by “[” and “]” since the brackets are escaped, for example: “[a]” and “[b]”

1\+1=2
matches “1+1=2”



Character classes

Character
Function
\d
Any digit
\D
Not a digit
\w
Any word character (Alphahnumeric characters, plus the underscore “_”)
\W
Not a word character
\s
Whitespace character (tab, carriage return, space, etc.)
\S
Not a whitespace
\A
First character in a string
\z
Last character in a string
\u
Uppercase letters
\l
Lowercase letters
\p
Visible characters and the space character
\b
Word boundaries
\B
Non-word boundaries


Characters

Character
Function
x
The character x
\\
The backslash character
\t
The tab character
\r
The carriage-return character
\e
The escape character


Resources

Regex101.com
Easy-to-use syntax testing environment for regular expressions.

Java Regex Documentation
Full documentation of the Java Regex version used by Quantemplate.

Java Regular Expressions Tester
Another environment to test the validity and output of a regular expression, including output group replacement functionality.