MS Word Regex

MS Office has a handy Word regex search and replace. Below are listed most common regex options

Basic Regex used in Word

? and *

The two most basic wildcards are ? and *. They are essentially similar in use.
? is used to represent a single character and * represents any number of characters. On their own, these have limited use.
s?t
 will find satset,sit ,sot and any other combination of three characters beginning with ‘s‘ and ending with ‘t‘. It will also find that combination of letters with a word, thus it would locate the relevant (highlighted) part of inset etc.

@

@ is used to find re-occurrences of the previous character (if any). e.g. lo@t will find lot or lootful@ will find ful or full etc.

< >

With any of the above (or any other combination of wildcards and characters), you can use the brackets < and > to mark the start and end of a word respectively. Thus in the example used above for ‘*
<s*t> would find ‘secret‘ and ‘serpent‘, but not ‘sailing boats‘ and ‘sign over documents‘. Though again, given the use of ‘*‘, beware as it will find the block of text from a word starting with ‘s’ to the end of the next word in the document ending with ‘t‘, e.g. ‘sailing boat‘ which may not be what you had in mind.
The <> brackets can be used in pairs as above or individually as appropriate e.g. ful@>  will find ‘full‘ and the appropriate part of ‘wilful‘ but not ‘wilfully

[]

Square brackets are always used in pairs and are used to identify specific characters or ranges of characters. e.g.:
[abc] will find any of the letters abc[F] will find upper case ‘F‘;
[A-Z] will find any upper case letter; [0-9] will find any single number; [13579] will find any odd numbers; [0-9A-Za-z] will find any numbers or letters.
The characters can be any character or series of characters, including space. Characters are processed in order – lowest first. If you are uncertain which character is lower than another check with ‘Insert > Symbol’.

[!]

[!] is very similar to [ ] except in this case it finds any character not listed in the box so [!o] would find every character except “o”.
You can use ranges of characters in exactly the same was as with [ ], thus [!A-Z] will find everything except upper case letters.

You can paste any (Unicode) character – unfortunately *not* characters from decorative (Symbol) fonts) – into your search expressions. So copying the first and last characters from the Greek or cyrillic subsets into a search:
[;-ώ] would match any Greek character α β γ δ ε …
<[Ё-ґ]@> matches any cyrillic word: Вы можете помочь мне? (“Can you help me please?”)
You can type in Unicode characters with the Alt-key (make sure NumLock is on, then hold down the Alt-key and type the numbers on the numeric keyboard). Since all characters from decorative fonts (Symbol-, Wingdings-fonts …) are kept in a special code page from &HF000 to &HF0FF, you can search for them with [Alt61472-Alt61695].

{}

Curly brackets are used for counting occurrences of the previous character or expression.
{n}         This finds exactly the number ‘n’ of occurrences of the previous character
{n,}  Finds at least the number ‘n’ occurrences.
{n,m}      Finds the number of occurrences from ‘n’ to ‘m’.
Note: The above examples employ a comma as a list separator {n,m} – for languages that use alternative list separators, substitute the local separator character (often a semi-colon {n;m}) as appropriate.
Counting can be used with individual characters or more usefully with sets of characters e.g. [deno]{4} will match done, node, eden) or bracketed groups: (ts, ){3} will match ts, ts, ts, .
(Unfortunately, there is no wildcard to search for “zero or more occurrences” in Word wildcard searches; [!^13]{0,} does not work).

( )

Round brackets have no effect on the search pattern, but are used to divide the pattern into logical sequences, where you wish to re-assemble those sequences in a different order during the replace – or to replace only part of that sequence. They must be used in pairs.
The partnering backslash character is used as a replacement string in conjunction with a number to indicate which pair of brackets numbered from the left is required e.g.
(John) (Smith) replaced by \2 \1 – note the spaces in the search and replace strings –  will produce Smith John or replaced by \2 alone will give Smith.

\

If you wish to search for a character that has a special meaning in wildcard searches – the obvious example being ‘?‘. then you can do so by putting a backslash in front of it: [\?] will find the question mark character ‘?
If you wish to find the backslash itself then you need to precede that with a backslash [\\].
The following is a list of the characters that have a special meaning in wildcard searches ( [ ] { } < > ( ) – @ ? ! * \ )

NOTES

You may wish to identify a character string by means of a paragraph mark . The normal search string for this would be ^p.
^p DOES NOT WORK in wildcard search strings! It must however be used in replacement strings, but when searching, you must look for the substitute code ^13.
Wildcard searches will also not find footnote/endnote marks – substitute ^2.
A-z would be expected to reproduce all the letters between A and z i.e. both upper case and lower case letters, which it does, but it reproduces all the characters from ASCII 65 to ASCII 122, and that block also includes the characters [ ] ` ^ _ /  Use A-Za-z instead.
The question mark ? is used to find individual characters. If used with curly brackets to define a range of characters eg
#?{1,3}#
 it will behave as an asterisk and find all the characters between the hash symbols.

Word Specific Control Codes used with the search/replace tool

Code Notes
^1 In-line picture
^2 Auto referenced footnotes
^5 Annotation mark
^9 Tab
^11 New line
^12 Page or Section break
^13 Paragraph break / ‘carriage’ return
^14 Column break
^19 Opening field brace (when field braces are visible)
^21 Closing field brace (when field braces are visible)
? Question mark
^? Any single character (not valid in the Replace box)
^- Optional hyphen
^~ Non-breaking hyphen
^^ Caret character
^# Any digit
^$ Any letter
^& Contents of ‘Find What’ box (Replace box only)
^+ Em dash (not valid in the Replace box)
^= En dash (not valid in the Replace box)
^u8195 Em Space Unicode character value search (not valid in the Replace box)
^u8194 En Space Unicode character value search (not valid in the Replace box)
^a Comment (not valid in the replace box) (Word 97-2000 only)
^b Section break (not valid in the replace box)
^c Replace with Clipboard contents (Replace box only)
^d Field
^e Endnote Mark (not valid in the Replace box)
^f Footnote Mark (not valid in the Replace box)
^g Graphic (In Line Graphics Only). In Word 2007 a forward slash / also appears to find in-line graphics. This appears to be an unintentional bug.
^l New line
^m Manual Page Break
^n Column break
^t Tab
^p Paragraph Mark –
^s Non-breaking space
^w White space (space, non-breaking space, tab); not valid in the Replace box
^nnn Where “n” is an ASCII character number
^32 and ^032 will both represent a space character, but ^147 will represent ô and ^0147 will represent “
^0nnn See above (Produces ASCII on Macintosh).
^unnnn Unicode character search where “n” is a decimal number corresponding to the Unicode character value.

Examples

Example 1 – Transpose first name and surname

There are many occasions when you are presented with blocks of text or numbers etc., where the order of the text is not what you might require in the final document. Swapping the placement of forename and surname as above is one such example – and don’t forget you can add to the replacement, even when using bracketed replacements e.g. you may wish John Smith to appear as Smith, John or, more likely, you may have a column of names in a table, where you wish to exchange all the surnames with all the forenames.

John Smith to become Smith John

You could do them one at a time, but by replacing the names with wildcards, you can do the lot in one pass.
Let’s then break up the names into logical sequences that can only represent the names.
At its simplest, we have here two words – John and Smith. They can be represented by <*>[space]<*> – where [space] is a single press of the spacebar.
Add the round brackets  (<*>)[space](<*>) and replace with \2[space]\1
Run the search on the column of names and all are swapped. Run it again and they are swapped back.

Example 2 – Transposing dates

This could be the changing of UK format dates to US format dates – or vice versa.
7th August 2001 to August 7th, 2001
To give an example of how most of the wildcards could be used in one search sequence to find any UK date formatted above to its equivalent US format date, the following search pattern will do the trick:

[0-9]{1,2}[dhnrst]{2} <[AFJMNSOD]*>[0-9]{4}

Breaking it down [0-9] looks for any single digit number, but dates can have two numbers so to restrict that to two, we use the count function. We want to find dates with 1 or 2 numbers so [0-9]{1,2}

Next bit is the ordinal ‘th’ – Ordinals will be ‘st’ ‘rd’ or ‘th’ so identify those letters specifically: [dhnrst]

There will always be two letters, so restrict the count to 2: [dhnrst]{2}

Next comes the space. You can insert a space [space]

The month always begins with one of the following capital letters – AFJMNSOD. We don’t know how many letters this month has so we can use the blanket ‘*‘ to represent the rest. And we are only interested in that word so we will tie it down with <> brackets: <[AFJMNSOD]*>

There’s another space [space] followed by the year. The years here have four numbers so [0-9]{4}

Finally add the round brackets to provide a logical breakup of the sequence:
([0-9]{1,2}[dhnrst]{2})[space](<[AFJMNSOD]*>)[space]([0-9]{4})
and replace with \2[space]\1,[space]\3
to re-order the sequence.

Example 3 – Adding or removing the period in salutations (Mr or Mr.)

Assume you are parsing addresses and wish to separate the honorific from the name. American usage puts a full stop (period) at the end (“Mr.”, “Mrs.”, “Dr.”) while British usage often omits the full stop.

([DM][rs]{1,2})( )

will find Mr Mrs Dr without the stop and

\1.\2

will put one in. Or vice versa

([DM][rs]{1,2}).

will find Mr. Mrs. Dr. with the stop and

\1

will take it out.

Example 4: Duplicate paragraphs (and rows)

(*^13)\1\1 will match any sequence of three identical paragraphs.
If you replace:
(*^13)\1
with
\1
it will delete all consecutive duplicate paragraphs in the document. Repeat until nothing is found, to delete all duplicate paragraphs in the document (as long as you have sorted the text first).
To delete duplicate rows in a table (provided you have no merged cells), you can convert the table to text (Table + Convert to Text, using a tab delimiter); delete the duplicate paragraphs using the above method, then convert the text back to a table.

Example 5: Tags

\<([!\<>]@>)*\</\1
will match any well-formed XML element including start-tag and end-tag such as:

<p>some text </p>

or

<customer-name>John Smith</customer-name>

Alternative way: In the Find What box, enter the following string:
\<*>
Leave the Replace With box empty.
Place an X in the “Use Wildcards” box.
Click Replace All.


Example 6: Formatting

By building up appropriate patterns, you can search for almost any combination of characters.
Of course you can also restrict your searches by specifying some style or formatting, or add formatting for the replacement text. See Finding and replacing non-printing characters (such as paragraph marks), other special characters, and text formatting for more on this.
A nice trick if you want to apply formatting to a part (but not all) of the search text is to put in “tags” in a first replacement.
In a find/replace, you can only change the formatting of the whole find-text; so you would need to do two find-replaces to get the job done. In the first, you would “tag” the text that has to be formatted; in the second, you format them (and remove the tags).

Find what:
(something)(something else)(another string)

Replace with:
\1$$\2##\3

and then remove the tags and apply the formatting in a second replace:

Find what:
$$(*)##

Replace with:
\1 ((bold))

Imagine, for instance, that you’ve got a text file, and headings are marked up by having 3 empty paragraphs before them and an empty paragraph after.

Find what:
^13{4}([!^13]@^13)^13

Replace with:
^p<H1>\1

Then Find that: 
\<H1\>(*) 

Replace with: 
\1 ((style “Heading 1”))

This will remove the empty paragraphs and format the headings.

BEWARE:

Sometimes Word will get confused if it encounters “escaped” brackets \( or \), for example “(\\)” will match *any* character, not only a backslash
Workaround: use “([” instead.

([a-z]\() throws an error – should find an “a(“.
Workaround: Use ([a-z][\(]) instead.

Not a bug but still annoying: You have to escape any special character even if you type its code; so ^92 will have the same problems as typing the backslash.

The construction {0,} (find zero or more of the preceding item) is refused as incorrect syntax. This concept is available in Unix regular expression matching, so it’s a curious omission.

You don’t always have to “escape” the special characters, if the context makes it clear that the special meaning isn’t wanted. [abc-] matches ““, and [)(] matches “)” or “(“. This may sometimes make your searches behave differently from what you expected.

More Examples – Greek

More Examples – Cyrrilic

Sources