The Regex operation allows you to find data matching a certain pattern which can then be extracted or changed. For example:
Change all instances of
Rd
to
Road
.
Extract the house number from a column of addresses.
Change a 4-digit ZIP code to a 5-digit ZIP code.
Patterns are defined using the language known as 'Regular Expressions', commonly called Regexes.
Watch the Regular Expressions tutorial to learn more.
Workspace
Column
Select the source column to search.
Regex
Enter the search term or regex pattern.
Output column
Either name a new output column, or select the source column to overwrite it. If no column name is entered, an unnamed column will be created.
Configure output
To remove the regex matches, select ‘Remove matched value’.
To replace the regex matches select ‘Advanced’ and enter a text string or a result pattern (e.g. $1, $2, $3).
Search patterns
Enter the search pattern in the Regex field. This could be a specific sequence of text or numbers, such as the word “Rd”. Alternatively, to search for patterns, use the Regular Expressions syntax. For example:
.
matches any character
\d
matches any digit
\w
matches any word character
( )
matches groupings of characters
See the Syntax reference section for details and examples of the most commonly used Regex syntax.
Result patterns
Once a pattern has been matched, it can be either be removed, replaced, modified or repeated. The result can overwrite the source column or can be written to a new column. The $ symbol is used to reference groups in the regex pattern.
Entering a simple text string, e.g.
Road
will overwrite all matches with the word “Road”.
$1
will write the first matched pattern. If no groups are defined, the whole match is written.
abc$1
will append “abc” to the start of the matched pattern.
If the pattern contains multiple groups,
$1
references the first group,
$2
the second group, etc.
For example, to change the date
01-Jan-15
to
01-Jan-2015
write a regex expression defining the year characters as one group and the other characters as another group:
(\d\d-\w\w\w-)(\d\d)
Now enter the results pattern:
$120$2
$1 writes the first group.
The digits
20
are added.
$2 writes the second group.
The result is
01-Jan-2015
Examples
1. Find and replace text
We want to remove the the word
Lime
from all rows in the the
Address
column.
1.
Select the source column
Address
.
2.
Enter the regex pattern to match: in this case we can just write the text
Lime
.
3.
Output to a new column called
Building No.
.
4.
Select ‘Remove matched value’ to create a new column from the contents of the
Address
column, minus the word
Lime
.
2. Change specific characters within a string
We want to change all instances of
Rd
to
Road
in the
Street
column.
1.
Select the source column
Street
.
2.
Enter the regex pattern to match: in this case we can just write the text
Rd
.
3.
We want to overwrite the data in the source column, so we select the output column to be
Street
, the same as the source.
4.
Select ‘Advanced’ to replace the matched patterns. Enter the replacement text
Road
.
3. Extract any numbers at the start of a string
We want to find the building number from a column of addresses and extract it to a new column. We can do this by writing a regex to identify any sequence of numbers at the start of an address.
1.
Select the source column
Address
.
2.
Enter a regex pattern:
^(\d+).+
^
anchors to the start of the string.
(\d+)
looks for a sequence of digits of any length.
.+
indicates that the digits should be followed by any number of characters.
3.
Output to a new column called
Building No.
.
4.
Select ‘Advanced’ and enter the result pattern
$1
to write the matched patterns to the new column.
4. Add a 0 to the start of 4 digit numbers
We have a column US ZIP Codes, which should all be 5 digits. Some are only four digits, so we want to find them and add a 0 to the start of each one.
1.
Select the source column
ZIP Code
.
2.
Enter a regex pattern:
^(\d{4})$
^
anchors to the start of the string.
(\d{4})
looks for a sequence of 4 digits.
$
anchors to the end of the string.
Both start an end anchors are required to ensure that the number is exactly 4 digits. Without them the regex would look for any 4 digits in a string.
3.
We want to overwrite the data in the source column, so we select the output column to be
ZIP Code
, the same as the source.
4.
Select ‘Advanced’ and enter the result pattern
0$1
to overwrite
0
followed by the matched pattern.
Syntax reference
Quantemplate uses the Java version of regex. For full syntax details see the Java Regex Documentation.
Character
Function
Example
|
Or
Matches either the expression before or the expression after
|
gray|grey
matches either “gray” or “grey”
( )
Grouping
gray|grey
and
gr(a|e)y
both match “gray” or “grey”
[ ]
Matches a single character that
is
contained within the brackets
[abc]
matches “a”, “b”, or “c”
gr[ae]y
matches “gray” or “grey”, but not “graey”, “graay”, etc.
[a-z]
specifies a range which matches any lowercase letter from “a” to “z”
[abcx-z]
matches “a”, “b”, “c”, “x”, “y”, or “z”
[hc]at
[hc]at matches “hat" and “cat”
[a.c]
matches only “a”, “.”, or “c”
[^ ]
Matches a single character that
is not
contained within the brackets
[^abc]
matches any character other than “a”, “b”, or “c”
[^a-z]
matches any character that is not a lowercase letter from “a” to “z”
[^b]at
matches all strings matched by .at except “bat”
q[^x]
matches “qu” in “question”. It does not match “Iraq” since there is no character after the q for the negated character class to match
{n}
Preceding item is matched exactly n times
a{3}
matches only “aaa”
[1-9][0-9]{3}
matches a number between 1000 and 9999
{min,}
Preceding item is matched min or more times
a{3,}
matches “aaa”, “aaaa”, “aaaaa”, “aaaaaa”, etc.
a{3,5}
matches “aaa”, “aaaa” and “aaaaa”
{min,max}
Preceding item is matched at least min times, but not more than max times
a{3,5}
matches only “aaa”, “aaaa” and “aaaaa”.
[1-9][0-9]{2,4}
matches a number between 100 and 99999
.
Matches any character
a.c
matches “abc”, “acc”, “adc”, etc.
gr.y
matches “gray”, “grey”, “gr%y”
.at
matches any three-character string ending with “at”, including “hat”, “cat”, and “bat”
?
Matches the preceding character
zero or one
times
ab?c
matches “ac”, “abc”
colou?r
matches “color” and “colour”
[hc]?at
matches “hat”, “cat”, and “at”
*
Matches the preceding character
zero or more
times
ab*c
matches “ac”, “abc”, “abbc”, “abbbc”, etc.
s.*
matches “s” followed by zero or more characters, for example: “s” and “saw” and “seed”
[hc]*at
matches “hat”, “cat”, “hhat”, “chat”, “hcat”, “cchchat”, “at”, etc.
+
Matches the preceding character
one or more
times
ab+c
matches “abc”, “abbc”, “abbbc”, but not “ac”
[hc]+at
matches “hat”, “cat”, “hhat”, “chat”, “hcat”, “cchchat”, etc. but not “at”
^
Anchors the
starting
position in a string
^[hc]at
matches “hat” and “cat”, but only at the beginning of the string
$
Anchors the
ending
position in a string
[hc]at$
matches “hat” and “cat”, but only at the end of the string
\
Escaped
Treats the escaped character as a literal rather than a regex command
\[.\]
matches any single character surrounded by “[” and “]” since the brackets are escaped, for example: “[a]” and “[b]”
1\+1=2
matches “1+1=2”
Character classes
Character
Function
\d
Any digit
\D
Not a digit
\w
Any word character (Alphahnumeric characters, plus the underscore “_”)
\W
Not a word character
\s
Whitespace character (tab, carriage return, space, etc.)