CIS 260 Regular expressions

Objective

Use regular expressions in Java

Regular expressions overview

Regular expressions provide a way of searching for patterns in text. They are considered extremely useful in modern programming, but were missing from early Java implementations. A number of third-party packages were created to provide regular expression functionality before there was an official Java implementation. While you may still see some of those in use, this course will cover the standard implementation. It tends to be faster than the rest, and once you understand how regular expressions work, you can transfer that knowledge quickly to alternative implementations.

References

Summary of pattern constructs

Construct	Matches
c	any regular character will match itself
\t	tab
\n	newline (line feed)
\r	carriage return
\a	alarm (bell)
\e	escape
\\	backslash
\	used to escape next character (\{ -> {)
[abc]	matches a, b, or c
[^abc]	any character except a, b, or c
[a-zA-Z]	any letter in upper or lower case
[a-d[q-t]]	same as [a-dq-t] (union)
[a-p&&[n-z]]	n, o, or p (intersection)
[a-z&&[^mn]]	a through z, except for m and n (subtraction)
.	any single character (except line terminator usually)
\d	[0-9]
\D	[^0-9]
\s	any whitespace character
\S	any non-whitespace character
\w	any word character [a-zA-Z0-9_]
\W	any non-word character
^	beginning of sequence (if at start of pattern)
$	end of sequence (if at end of pattern)
\b	a word boundary
\B	a non-word boundary
\p{Lower}	any lowercase character
\p{Upper}	any uppercase character
\p{Alpha}	any alphabetic character
\p{Digit}	any digit
\p{Alnum}	any digit or alphabetic character
\p{Punct}	any punctuation character
\p{Blank}	a space or tab
\p{Cntrl}	any control character
\p{XDigit}	any hexadecimal digit
\p{Space}	any whitespace character
a\|b	matches either a or b (alternation)
()	used to group patterns
\1	backreference to first matched group

Summary of repetition constructs

Construct	Effect on previous item
?	match 0 or 1 time, greedy
*	0 or more times, greedy
+	1 or more times, greedy
{n}	exactly n times, greedy
{n,}	at least n times, greedy
{n,m}	at least n but not more than m times, greedy
??	match 0 or 1 time, non-greedy
*?	0 or more times, non-greedy
+?	1 or more times, non-greedy
{n}?	exactly n times, non-greedy
{n,}?	at least n times, non-greedy
{n,m}?	at least n but not more than m times, non-greedy
?+	match 0 or 1 time, super-greedy
*+	0 or more times, super-greedy
++	1 or more times, super-greedy
{n}+	exactly n times, super-greedy
{n,}+	at least n times, super-greedy
{n,m}+	at least n but not more than m times, super-greedy

You probably won't have to worry anytime soon about whether a quantifier is greedy, super-greedy (possessive), or reluctant, but here is what those terms mean:

greedy: Greedy is the default. A greedy quantifier will match as much as it can, and then, if no match can be made, it backs off, slowly giving up characters until a match can be made, or until attempts at matching fail.
reluctant: A reluctant quantifier will match as little as it can, and then, if no match can be made, it starts grabbing additional characters that it can match until an overall match can be made, or until attempts at matching fail.
possessive (super-greedy): A super-greedy quantifier will match as much as it can, and then, if no match can be made, refuses to back off, leaving the attempt at a match in failure.

Using regular expressions

You can use regular expressions a few different ways:

formal use of Pattern and Matcher classes:
// look for hat or heat anywhere in line Pattern p = Pattern.compile(".*he?at.*"); // Now give it a String to search Matcher m = p.matcher("Have you seen my hat anywhere?"); // check for a match (should be true in this case) boolean b = m.matches();
use Pattern's convenience method "matches":
boolean b = Pattern.matches(".*he?at.*", "Have you seen my hat anywhere?");
use String's convenience method "matches":
String s = "Have you seen my hat anywhere?"; boolean b = s.matches(".*he?at.*");

The Matcher class has three methods used to search for patterns:

matches: attempts to match the entire input sequence against the pattern
lookingAt: attempts to match the input sequence, starting at the beginning, against the pattern
find: scans the input sequence looking for the next subsequence that matches the pattern

Demonstration programs

RegEx1.java: basic regex demo
RegEx2.java: interactive regex demo
2cities.txt: text file for RegEx2 program
Tokenize.java: demonstrates an easy way to tokenize a string
FindLinks.java: finds links in web pages online
Tokenize.java: demonstrates StringTokenizer (no regex)

String methods

Several String methods use regular expressions. They are:

boolean matches(String regex)
String replaceAll(String regex, String replacement)
String replaceFirst(String regex, String replacement)
String[] split(String regex)
String[] split(String regex, int limit)