Most text editing applications include Find and Replace features, although these can only replace a specific text string in the current document or selection by alternative text. For example, you could find every instance of the word cat and replace it by dog. Most software has a Match Entire Word option, preventing, for example the changing of catwalk to dogwalk.
Some applications also accommodate a wildcard character, such as an * (asterisk). If you then search for c*t you’ll be able, for example, to replace both cot and cat by dog, should that be required. Unfortunately, things can get confused if the wildcard character itself is contained in the text.
Other programs support the Global Regular Expression Parser (GREP), which accommodates a much greater manipulation of text. For example, if you have a database of names as tab-delimited text you can change their form from Mr. Fred James Bloggs to BLOGGS, FJ. A regular expression contains a search pattern of characters that’s compared with the target text, usually an entire document or the currently selected text. An appropriate replacement pattern is then applied.
A regular expression is defined using combinations of the following:-
Each of these match the corresponding character in the text itself. All ASCII letters and numbers are in this category. So if you include the letter a in your regular expression it matches any occurance of a in the text.
Each of these have special properties, allowing various kind of characters to be identified. Typically, they consist of punctuation characters and special character strings. However, this means that you must use the escape character, usually \ (backslash), in front of any such character that is to be recognised literally. So, if you want to find cat? in the text you must use cat\? as the search pattern.
\ characters, as well as " (double quote marks), are preceded by an extra \ character. This means that cat\?, for example, must be entered as cat\\?.These have a special meaning and consist of a single character preceded by a \ (backslash).
The following sections consider the various types of metacharacters and shortcuts, taking the form of anchors, character classes, quantifiers, alternation and groups. In the examples shown below, single metacharacters are shown in red while special strings and the matching text is shown in blue. The author is indebted to the Regex documentation, which forms the basis of these notes.
An anchor ensures that a match only occurs at a specific position in the target text.
This matches the notional ‘empty string’ that exists at the beginning of the target text, thereby anchoring a search pattern to the start of the material. Here are two examples, where P is the search pattern, T is the target text and R is the result, the latter showing the actual matching text in blue:-
P: \AThis
T: This old man
R: This old manP: \A old man
T: This old man
R: «No match found»This matches the notional ‘empty string’ that exists at the end of the target text, thereby anchoring a pattern to the end of the material.
P: man\Z
T: This old man
R: This old manP: This\Z
T: This old man
R: «No match found»Often known as a wildcard metacharacter, this matches the notional ‘empty string’ at the start of the text or after a new line character, thereby anchoring a pattern to the beginning of a line. The \r shown here denotes the new line character, corresponding to a CR (carriage return) in the Classic Mac OS.
P: ^old
T: This\rold man
R: This\rold manSometimes known as a wildcard metacharacter, this matches the notional ‘empty string’ at the end of the text or before a new line character, thereby anchoring a pattern to the end of a line.
P: old$
T: This\rold\rman
R: This\rold\rmanThis metacharacter can be used with the caret to identify a matching line, as in:-
P: ^my matching line of text$
T: my matching line of text\rline of text\r
R: my matching line of text\rline of text\rThis matches the ‘empty string’ at the start of a word, anchoring a pattern to this point, although this isn’t recognised by all GREP applications. The vertical bar shown here indicates the anchor point, although this character doesn’t actually appear in the result.
P: \<rat
T: The brat was not a rat!
R: The brat was not a |rat!This matches the ‘empty string’ at the end of a word, anchoring a pattern to this point, although this isn’t recognised by all GREP applications.
P: rat\>
T: Rattle, rattle went the rat!
R: Rattle, rattle went the rat|!This matches the ‘empty string’ at the beginning or end of a word, anchoring a pattern to this point.
P: \brat\b
T: Rattle-de-dum went the rat
R: Rattle-de-dum went the |rat|Such boundaries exist between a \w and \W character (see below) in either direction.
This matches the ‘empty string’ in a word, anchoring a pattern to a position that’s not word boundary.
P: rat\B
T: rat's like rattles!
R: rat's like rat|tles!P: \Brat
T: you dirty, dirty rat
R: «No match found»These elements let you create patterns for matching particular classes of characters.
Often known as a wildcard metacharacter, this matches any class of character, apart from the NUL control code character, the latter also identified as \0 in regular expressions.
P: c.t
T: cathedral
R: cathedralThis character class, known as a list or bracket expression, consists of a list of one or more metacharacters or literal characters contained within square brackets. Note that the effect of metacharacters used in such a list is different to those outside of character classes. For example, a . (dot) used inside a list represents a literal dot, not the match-all character described above.
] (square bracket) in a list you should either enter it as the first item, as in []b] or precede it with a \ (backslash), as in [b\]].The following items can be found in a list:-
Here are two examples of how literal characters can be used:-
| Pattern | Matches |
|---|---|
| [abc] | a, b or c |
| Defen[sc]e | Defense or Defence |
This metacharacter is placed between two other characters to indicate a range of characters in the normal ASCII sequence. However, the dash reverts to a literal character when it’s:-
\ (backslash)Here are three examples:-
| Pattern | Matches |
|---|---|
| [a-z] | Any lowercase letter from a to z |
| [0-9] | Any number from 0 to 9 |
| <h1>[a-zA-Z0-9 ]+</h1> | Level 1 HTML heading |
When used as the first element in the list, this matches any character not in the list. In other words, it negates any following character classes.
Here are two examples:-
| Pattern | Matches |
|---|---|
| [^a-z] | Any character other than a lowercase letter |
| <!--[^>]+--!> | HTML comments |
In the second of these, the pattern in the list matches any character up until a >.
As usual, this acts as an escape character, allowing characters to be represented literally. For example, [,\-\]] can be used to match a comma followed by a dash and a closing square bracket.
POSIX stands for Portable Operating System Interface, which allows standard terminology to be used across different operating systems. These classes, which must be enclosed in double brackets, are identified by unique strings of characters, as shown below.
| POSIX Class | Matches |
|---|---|
| [[:alnum:]] | Letters, including diacritical characters, and digits |
| [[:alpha:]] | Letters, including diacritical characters |
| [[:ascii:]] | All characters (ASCII codes 0 to 127) • |
| [[:blank:]] | Space or tab (ASCII code 32 or 09) |
| [[:cntrl:]] | Control codes (up to ASCII code 32, also code 127). |
| [[:digit:]] | Any numerical digit from 0 to 9 |
| [[:graph:]] | Printable characters, except space (ASCII codes 33 to 126) |
| [[:lower:]] | Lowercase letters, including diacritical characters |
| [[:print:]] | Printable characters (ASCII codes 32 to 126) |
| [[:punct:]] | Non-control and non-alphanumeric characters used for punctuation |
| [[:space:]] | Space, HT (tab), CR (carriage return), LF (line feed), VT or FF |
| [[:upper:]] | Uppercase letters, including diacritical characters |
| [[:word:]] | 'Word' alphanumeric characters, including some non-ASCII • |
| [[:xdigit:]] | Hexadecimal digits 0-9, a-f or A-F |
The enhanced version of POSIX used with the Perl language also allows negation of classes, as in [[:^digit:]], which detects a non-numerical character.
These use normal characters preceded by a \ (backslash), as shown here:-
| Shortcut | Matches |
|---|---|
| \d | Any digit, as in [0-9] pattern • |
| \D | Any non-digit character, as in [^0-9] pattern |
| \s | Whitespace: space, HT (tab), CR (carriage return), LF (line feed), VT or FF |
| \S | Non-whitespace character (opposite of \s) |
| \w | Word character; similar to [a-zA-Z0-9_] but including diacriticals |
| \W | Non-word character; similar to [^a-zA-Z0-9_] but excluding diacriticals |
The following character shortcuts are used by Frontier and other applications:-
| Shortcut | Matches |
|---|---|
| \xNN | ASCII character or code identified by hex value of NN |
| \^C | Control code, where C is character @, A to Z, [ to _ * |
| \b | BS (Backspace); same as \x08 * |
| \e | ESC (Escape); same as \x1B * |
| \f | FF (Form Feed); same as \x0C |
| \n | New line as LF (line feed); same as \x0A • |
| \r | CR (carriage return); same as \x0D • |
| \s | SP (Space); same as \x20 * |
| \t | HT (Horizontal Tab); normal tab; same as \x09 • |
| \\ | \ (Backslash) |
• Frontier constant
* Non-standard and not universally used
A quantifier or repeat sub-pattern describes how many times a preceding match should occur.
This metacharacter accepts any number of preceding matches or none at all, as shown here:-
P: x*
T: xxxxxxxxx
R: xxxxxxxxxP: ca*t
T: The cat made a noise like "caat followed by ct"
R: The cat made a noise like "caat followed by ct"The next example shows how valid HTML tags that contain spaces can be detected:-
P: < *h1 *>
T: <h1>My heading</h1>some text< h1 >2nd heading</h1>
R: <h1>My heading</h1>some text< h1 >2nd heading</h1>Remember, this operator always matches something, even the nominal ‘empty string’ that occurs at the beginning of the target text, as shown here:-
P: x*
T: This old man
R: |This old manalthough you should note that the result is at the boundary of the text but isn’t the text itself.
This accepts one or more preceding matches, as shown below:-
P: ca+
T: car and caaar and cr
R: car and caaar and crWhen trying to match more than one occurrence, this more effective than the asterisk operator.
Similar to the asterisk metacharacter but only accepting one preceding match or none at all.
P: ca?
T: car and caaar and cr
R: car and caaar and crLets you set how many times a preceding match is allowed to occur. The following forms are used:-
| Pattern | Matches |
|---|---|
| {COUNT} | COUNT occurrences of preceding expression |
| {MIN,} | MIN or more occurrences of preceding expression |
| {MIN, MAX} | Between MIN and MAX occurrences of preceding expression |
where COUNT, MIN and MAX must be replaced by integer numbers. Here’s an example:-
P: ab{1,3}
T: a, ab, abb, abbb, abbbb
R: a, ab, abb, abbb, abbbbUnfortunately, all the quantifiers described above are greedy. In other words, they find the longest match. Consider the following HTML example:-
P: <.+>
T: xxx<b>My bold text</b>yyy
R: xxx<b>My bold text</b>yyywhich, instead of matching the individual HTML tags, finds all the text in the outer < and > brackets. This is caused by the quantifier, which takes the search to the end of the line, then works backwards until it hits the last > bracket. You can get round this problem by using the pattern shown here:-
P: <[^>]+>
T: xxx<b>My bold text</b>yyy
R: xxx<b>My bold text</b>yyyor by using a non-greedy quantifier, as provided in applications such as BBEdit 6.5:-
P: <.+?>
T: xxx<b>My bold text</b>yyy
R: xxx<b>My bold text</b>yyyAs you you can see, the quantifier is changed into this form by adding a question mark.
Alternation allows the use of different expressions in a list, any of which could be a match. The list of expressions is delimited by a vertical bar, as shown in the following example:-
P: first|1st|winner
T: the eventual winner was 1st in the first race
R: the eventual winner was 1st in the first raceAlthough regular expressions are greedy, causing them to obtain the longest possible first match, the alternation metacharacter has the lowest priority of any operator. This is demonstrated by the following example, which doesn’t give the expected result:-
P: this and|or that
T: you can say this and that or this or that
R: you can say this and that or this or thatTo fix this problem you can use grouping brackets (see below), as shown here:-
P: this (and|or) that
T: you can say this and that or this or that
R: you can say this and that or this or thatEnclosing a pattern within curved brackets creates a group, also known as sub-pattern or sub-expression. As mentioned above, this limits the scope of the alternation metacharacter. It also lets you identify elements in a search pattern or apply a quantifier to an entire group, as in this case:-
P: [0-9]+(\.[0-9]*)?
T: Most cars have 4 seats, but an average family contains 4.2 people
R: Most cars have 4 seats, but an average family contains 4.2 peoplewhere the ‘match one or none’ operator is used to ensure that both numbers are detected successfully, even though the first value doesn’t contain a decimal fraction.
A back-reference refers to earlier groups or sub-patterns. It consists of a \ (backslash) followed by an integer number, the latter obtained by counting the previous sub-patterns from left to right. The following example detects doubled words:-
P: (\w+)\s+\1
T: The big black black dog
R: The big black black dogIn this instance, the sub-pattern that captures the letters of a word is referred to as \1. The doubled word can then be replaced in a replacement string (see below) by simply entering \1.
?: immediately after the opening brackets of the group.^ (caret) prefix instead of \ for back-references.Applications often use Perl-style pattern extensions, some of which are described below, and all of which begin with a curved open bracket followed by a question mark.
As mentioned above, back-references are counted from left to right. However, any groups in this form are excluded from the count, allowing you use convenient groupings with logical numbering. Technically speaking, this is known as cluster-only parenthesis or a none capturing group.
This kind of group, which can’t be enclosed in other groups, contains comment text, as shown here:-
P: ger(?#this is a comment)anium
T: the geranium in a pot
R: the geranium in a potHere’s an example where the matching of text is forced to be insensitive:-
P: (?i)cat
T: cat CAT
R: cat CATThe search can be made sensitive to case by inserting a ‘negating’ hyphen:-
P: (?-i)cat
T: cat CAT
R: cat CATPerl also accommodates other search modifier letters than can be used in combination with i. Note that a colon can also be added after such letters if you don’t want the group included in back-references.
In common with all the other positional assertions that follow, this extension doesn’t include the look-ahead or look-behind material in the result. In this example, the look-ahead function detects the semicolon after a word, giving a result of cat but without including the actual semicolon:-
P: \w+(?=;)
T: mouse cat; dog
R: mouse cat; dogSuch a result, in this case lacking the semicolon, can also be used in a back-reference.
This extension looks for something that doesn’t exist after a pattern, as in:-
P: one(?!two)
T: one onetwo two
R: one onetwo twoOnce again, the look-ahead element isn’t included in the result.
This example looks for one immediately before two and returns the latter:-
P: (?<=one)two
T: one onetwo two
R: one onetwo twoStrings matched by a look-behind assertion must have a fixed length, which means that patterns creating results of varying lengths can’t be used within the brackets.
This example looks for one that’s not before two and returns the latter:-
P: (?<!one)two
T: one onetwo two
R: one onetwo twoOnce again, the matched strings must have a fixed length.
These take the following forms:-
IF-THEN: (?(condition)YES-pattern)
IF-THEN-ELSE: (?(condition)YES-pattern|NO-pattern)
Here’s an example of how the IF-THEN-ELSE conditional can be used:-
P: \d+(?(?<=[1357]) is odd| is even)
T: 123 is odd, 28 is even, 13 is even
R: 123 is odd, 28 is even, 13 is evenUnder normal circumstances, searching for a pattern such as /d+xx in a string such as 123yy involves multiple searches. Having matched the three numbers, the search engine finds that the letters don’t correspond so tries again, looking for the letters after the first two numbers. It repeats this until the match finally fails. Normally, this isn’t a problem. However, where large amount of text are being processed, this extra processing can be time-consuming. To prevent such repeated searching you can use the once-only form, which for the above example is (?>\d+)xx.
Regex Documentation, Script Meridian, 1998
BBEdit 6.5 Help file, Bare Bones Software Inc, 2002
©Ray White 2004.