The Strange and Peculiar World of Emacs Regular Expressions

<2023-05-20 Sat>

The aim of this video is to provide you with clarity and some signposts that will help you transverse the tricky and difficult terrain of regular expressions in Emacs. Regular expressions are a difficult topic in themselves and they are also difficult because Emacs handles regular expressions in a way different from other text editors. One needs to understand "the Emacs way" to use them effectively. Emacs regexp are frustrating at one level and powerful at another. The goal of this video will be to create a complex regular expression and to do a search and replace in an Emacs buffer.

A note on escapes

When writing regular expressions in Emacs Lisp, you need to use double backslashes (\\) to escape special characters. However, when entering regular expressions INTERACTIVELY, such as with the commands isearch-forward-regexp, query-replace-regexp, replace-regexp, re-builder, or any other interactive command in Emacs, you only need a single backslash to escape special characters. This is because you are not dealing with the Lisp reader in these cases but entering a string directly to the regular expression engine.

Two types of characters

In regular expressions syntax there are two types of characters: literal (ordinary) and special. An ordinary character matches that same character and nothing else. "l" matches "l" and nothing else. Special characters have a special meaning. The special characters are:

$^.*+?[\

The character ] is special if it ends a character alternative

The character - is special inside a character alternative

If a character is preceded by \ it becomes special. It is considered "escaped".

The special characters

. matches any single character except a newline

e.g. a.d will find and a d a,d etc.

* matches the immediately preceding character zero or any number of times

e.g. o* finds

"o"

"oo"

"ooo"

"ooooooooooooo"

Another example: ca*t would match "ct", "cat", "caat", etc.

Note: While you cannot visually see the empty string matched on the screen in the same way you see characters or words matched, it is still a valid entity in programming and plays an important role in many algorithms and computations.

is similar to * except that it matches the immediately preceding character one or more times:

"o"

"oo"

"ooo"

"ooooooooooooo"

? matches the preceding operator once or not at all.

Hence 'ca?r' will find car and cr and nothing else.

*?, +?, ?? are non-greedy alternatives to * + and ?. By themselves they match as much as they can. With a following ? they are less greedy, and will match as little as possible. (We are always less greedy with someone watching!)

{} denotes number of repetitions of an immediately preceding regular expression.

e.g. [0-9]{4} will match 4 digits, but one would need to add escapes:

[0-9]\{4\}

e.g. 1969

^ When used in a character set ^ has the special meaning of any characters except those in the set.

[⁰-9]

^ marks a character at the beginning of a line

NOODLE

$ matches a character only at the end of a line

Character sets

[] denotes a character set. A character set is a set of characters enclosed between square brackets.

Examples:

[ad] single a's or d's in a buffer
[a-z] all lower case letters
[A-Z] all upper case letters
[a-zA-Z] ALL letters
[0-9] The digits 0-9
[a-zA-Z0-9] All letters and digits.

Note: In Emacs one cannot use \d for a digit (as in Python and Perl). One must use [0-9] or [:digit:]

Knowing which character to escape

The following characters must be escaped in Emacs or they will be treated literally:

| denotes alternatives and must be escaped, e.g. John Sally

() must be escaped to serve as a capturing group

{} must be escaped to denote repetition

[] do not need to be escaped.

/ Does not need to be escaped

\ Does have to be escaped

Match Constructs

Emacs support match constructs, which are a specific pattern that one can use to match specific pieces of text. This is what makes Emacs regex especially powerful.

NB: Note that in match constructs \ is not considered an escape but is part of the construct.

\< and \> Match the beginning and end of a word e.g. world, worldly

\_< and \_> Match the beginning and end of symbol

\scode (syntax code) Matches characters belonging to a syntax class.

\scode is extremely useful. For example, to find all whitespace characters one would type \s- (where s stands for syntax and - stands for any whitespace character)

Other examples of syntax codes include:

- Word Constituents \sw
- Symbol constituents \s_ (used most often in programming)
- Open and closed parenthesis \s( and \s) will find pairs (but not pears!)
- Punctuation class \s.
- String characters \s" e.g. "The man ran from Fran to his Van amid the clang." I.e. String characters are symbols that mark a block of characters
- Find characters that define a comment boundary \s< \s>
	e.g. .emacs file ;;

Since Emacs can search for syntax classes it is helpful to know what class a given character belongs to. One can find this out by putting the cursor on any character and invoking C-u C-x =

\Scode (with the capital letter "S") matches characters not belonging to the syntax class identified. For e.g. §- will search for all characters that are NOT whitespace

Capturing Groups or Back-references

Marks out text that you will later wish to reference.

It involves putting characters into groups:

foobarnitwit

e.g. (foo)(bar)(nitwit)

One then refers back to these groups using syntax like: \1 \2 \3

So, one marks the text using parentheses and references it later with a backslash and a digit (1-9). One can also use \? which will prompt for input text and \& which will match the whole string.

Note that in back-references \ is considered an escape character.

Note: Characters that are put into groups can be modified or left as is.

How to find/replace using regular expressions in Emacs

To find a string using regex use:

C-M-s isearch-forward-regexp

C-M-r isearch-backward-regexp

To find and replace use either:

query-replace-regexp (which asks for user input)

replace-regexp (no user input)

To test a regular expression:

Use re-builder

Set reb-re-syntax variable to "string"

(setq reb-re-syntax 'string)

1966/08/20
2023/02/12

<1966-08-20>
<2023-02-12>

How to type complex expressions

First we type the string out without the escapes:

<1966-08-20>
<2023-02-12>

$[0-9]\{4\}$/$[0-9]\{2\}$/$[0-9]\{2\}$

String without escapes:

"([0-9]{4})/([0-9]{2})/([0-9]{2})"

Then we add the escapes:

"$[0-9]\{4\}$/$[0-9]\{2\}$/$[0-9]\{2\}$"

Replacement string:

"<\1-\2-\3>"

Character Classes

[:digit:] a digit, same as [0-9]

[:alpha:] a letter (an alphabetic character)

[:alnum:] a letter or a digit (an alphanumeric character)

[:upper:] a letter in uppercase

[:lower:] a letter in lowercase

[:graph:] a visible character

[:print:] a visible character plus the space character

[:space:] a whitespace character, as defined by the syntax table, but typically

[ \t\r\n\v\f ], which includes the newline character

[:blank:] a space or tab character

[:xdigit:] an hexadecimal digit

[:cntrl:] a control character

[:ascii:] an ascii character

[:nonascii:] any non ascii character

Note: When you want to use syntax classes, you have to use them within ANOTHER set of square brackets. These additional brackets define the character set.

Some further anomalies of Emacs regex

In an interactive search, a space character stands for one or more whitespace characters (tabs are whitespace characters). You can use M-s SPC while searching or replacing (.e.g. after invoking C-M-s) to toggle between this behavior and treating spaces as literal spaces. Or put the following in your ,emacs to override this behaviour.

(setq search-whitespace-regexp nil)

You can enter a newline character using C-q C-j

Special examples

To strip Bible text of all references:

"[:[:digit:]]+ "

Return to Home