Java: Regex Language Intro

Regular expressions are a programming language for describing patterns in strings. At the syntax level, it's important to understand which characters are metacharacters (have a special meaning), and which are literal characters (stand for themselves). At the symantic level, several basic concepts are important: character classes, quantifiers, boundaries, grouping, and alternation. These fundamental regex elements apply to all implemenations, and will solve most or your regex needs.

Metacharacters

The characters that have special meaning are called metacharacters. A preceding backslash ("\") turns a metachacter into a literal character. The set of metacharacters in character classes, ie between [ and ], is different.

CharMeaning
\Turns metacharacters into literal characters, and literal characters into metacharacters. Because this is also the Java escape character in strings, it must be doubled.
[Starts character class definition.
(Starts a group.
{Encloses repetition count. {min, max}
^Matches boundary at beginning. Class negation when immediately after [.
$Matches boundary at end.
.Matches any single character.
?Preceding element must match zero or one time.
*Preceding element must match zero or more times.
+Preceding element must match one or more times.
|Either preceding or following element must match.

Boundaries

A boundary is the position between two characters or at the beginning or end. The two most commonly used boundaries are ^ (matches at beginning) and $ (matches at end).

CodeMeaning
^Beginning of a line.
$End of a line.
\ABeginning of the input.
\zEnd of the input.
\ZEnd of input, ignoring final terminator, if any.
\GEnd of the previous match (to indicate where new match should start.

Character classes

A character class defines a set of characters. It matchs exactly one character unless it is followed by a quantifier specifying how many.

Predefined character classes

Notice the uppercase class is the negation of the lowercase class.

CodeMatches
.Any character.
\dA digit. Same as [0-9]
\DA non-digit. Same as [^0-9] or [^\d]
\sA whitespace character. Same as [ \t\n\x0B\f\r]
\SA non-whitespace character. Same as [^\s]
\wA "word" character. Same as [a-zA-Z0-9_] includes underscore, which not all regex libraries do. It does NOT include the non-ASCII Unicode characters (See below).
\p{L}Unicode letters.
\WA non-word character. Same as [^\w]

Quantifiers

An element, X, which may be a literal character, a character class, or a group, may be followed by a quantifier, which indicates how often it should be matched.

Quantifiers are classified as greedy or lazy. Greedy quantifiers try to match as much as possible, and reduce the amount they match only if forced to by later failures. Lazy quantifiers match as little as possible, and only expand if required by a later failure. Unlike most regex libraries, Java supports possesive quantifiers, which are not only greedy, but won't give back anything they've matched. They can provide a speed advantage in some circumstances.

CodeMeaning
X?X must match zero or one time. Greedy.
X*X must match zero or more times. Greedy.
X+X must match one or more times. Greedy.
X{n}X must match n times.
X{n,}X must match at least n times. Greedy.
X{n, m}X must match at least n times, but no more than m times. Greedy.
X??X must match zero or one time. Lazy.
X*?X must match zero or more times. Lazy.
X+?X must match one or more times. Lazy.
X{n,}?X must match at least n times. Lazy.
X{n, m}?X must match at least n times, but no more than m times. Lazy.

Grouping

CodeMeaning
(X)This matches X as usual, and it also records the beginning and end of the substring that X matches. This forms a group that can be used in one of three ways:
  • Matcher methods can be called to get the number of groups, a particular group by number, or the beginning and end character index of any group.
  • Back references can be made inside a pattern to match previous groups that were matched. These references are of the form \n, where n is the number of a previous group.
  • Matcher appendReplacement() method may reference groups in the replacement string using $n,

Group 0 is the entire match. For other groups, the number of the group corresponds to the number of the left parenthesis in the regex when counting from the left, starting at one.

The group includes only the last repetition caused by quantifiers. Enclose the quantifiers in a group if you want the repeations in one group.

Alternation

CodeMeaning
X|YTries to match X. If that fails, it tries to match Y.