Chapter 9. Regex (Regular Expression)

Table of Contents

Regex Basics
More Regex
Regex Tips and Tricks
Unsupported regex features

Rax supports regular expressions. There are many families of regular expressions. Rax supports a good subset of Perl Compatible Regular Expressions (PCRE2, see www.pcre.org).

By convention, a regex is expressed as a raw string, using single quotes. This is convenient because the backslash, \, character has special but different functions in both regular strings and regexes. That is not to say that a regex cannot be expressed as a regular string. Every raw sting has a regular string equivalent. If a regex contains single quotes, a raw string might actually be less desirable. Look at these two equivalent regular expressions: "'[a-zA-Z']'" and '''[a-zA-Z'']'''.

Regex Basics

To match a character, say 'A', use that character. The string "aap" matches the regex (string) 'aap'. To make regular expressions useful, some characters, called meta characters, have a special meaning. They allow quantifying, alternatives, anchoring, grouping, escaping and more.

Quantifying.  The (meta) character '*' means match zero or more of the item to the left. The string "aap" matches the regex 'a*p' but so would "ap", "aaaaap" , and even "XXpXX". The latter matches because in Rax a regex only has to match a part of the string, and zero times the letter 'a' plus one letter 'p' can be found (i.e., matched) in the string "XXpXX". Other ways to quantify an item are: '+' for match one or more and '?' for match zero of one.

Alternatives.  The meta character '|' means match either everything on the left or on the right. The string "aap" matches the regex 'aap|noot|mies' and so will the strings "noot" an "my name is 'mies'". An other way to specify alternatives is by using what is traditionally called a character class. A character class starts with a '[' and ends with a closing ']'. For example '[abcdefg]' would match one character in the "class" 'a' through 'g'. Actually ranges are so common that only the beginning and the ending of the ranges have to be specified if a '-' is inserted in between. So the previous regex can also be expressed as '[a-g]'. There are many other ways to specify alternatives, the most common one is the meta character '.' which stands for any character. For example any string at least three characters long would match the regex '...'.

Anchoring.  There are many ways to anchor a sub regex, the most common ones are the anchor meta characters, '^' matching the beginning of the string and '$' matching the end of a string. So the string "aap" matches the regex '^aap$' but the string "aap!" does not.

Grouping.  All meta characters have a default scope. They either work on single neighboring characters or on the entire neighboring expression. To deviate from the default, grouping can be used. By placing an (sub)regex in parentheses, the scope of the enclosed meta characters can be limited or the scope of neighboring meta characters can be expanded. For example, the string "aapaapaap" matches the regex '^(aap)+$' but the string "aapaa" does not.

Escaping.  The meta characters are uncommon in common strings, however a meta character can be what is traditionally called "escaped" to turn it back into a literal character. A meta character '\' is used for that. It works on all meta characters including it self. Just as escaping a meta character turns it into a literal character, so doe escaping literal characters into special or instructional characters. The special characters are mostly unprintable characters and the escaping works just like in regular strings. The expression "\n" stands for newline and "\t" means a tab in regex strings and regular strings alike.

Combining these meta characters allows for sophisticated matching of patterns. For example, most valid email address will match '^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$' while very few random strings do.