13 Features of Regular Expressions

by mike on November 12, 2010

Following is a summary of some of the major features of regular  expressions.

1. Supported by Linux/UNIX Utilities
Most commands will support at least the basic aspects of regular expressions.  What this means is that as a tool enhancement regular expressions have important implications are are worth your time in learning.  Three major utilities demonstrate the importance of regular expressions.
a. grep – a line parsing tool that is based on regular expressions
b. awk – a filed parsing tool based on matching text and using regular expressions
c. sed – a stream editor which facilitates the modification of text streams

Experience with these three tools or utilities is critical in the building of shell scripts.

2.  Reduces Evaluation Times
Because the use of regular expressions provides the ability to perform multiple tasks at one time, regular expressions save time and resources on a server.  Instead of performing many passes over the same text in order to achieve one goal at a time, you can achieve many goals using regular expressions.

3. Provides Concise Search Pattern Descriptions
Regular expressions allow you to create specific and concise descriptions of what you are searching for.  For example, if you were looking to view the last 10 lines of maillog but only the maillogs which are archived, “1-4” and not the current maillog you could use a regular expression like this:

tail ma*.[1-4]

This expression uses the “ma” as a text string to separate it from the logs messages  and mcelog.  The “*” provides a  concise description by replacing the letters “illog” and the “[1-4]” allows for the examination of four separate logs eliminating the log maillog.

Your output is then defined with 4 separate sections, one for each maillog that is archived.
==> maillog.2 <==
Aug 30 07:01:57 mail postfix/pickup[8515]: 7C6062EF0B: uid=0 from=<root>
Aug 30 07:01:57 mail postfix/cleanup[15100]: 7C6062EF0B: message-id=<20100830130157.7C6062EF0B@powpost.example.com>

4.  Uses Metacharacters as Well as Regular Characters
a. Metacharacters include:
* – matches 0 or more times
? – matches 0 or one times
+ – matches 1 or more times
[] – define character classes of one character [Bb]
() – group characters and to provide alternation (mail | db), group text
^ – anchor text at start of line
$ – anchor text at end of line
|  – matches alternate words
.  – match only one character
\ – escapes following character
{} -  qualifier for max or min, max {1,6}
b. Regular Characters include: A-Z, a-z, 0-9 and “_”.

5. Priority for First Match
The first match of  a regular expression takes priority or in other words “wins”.  Unless you use some type of qualifier that first match will also take priority and the search will discontinue.

6. Metacharacters Anchor Locations
An important use of metacharacters is to anchor the search to the start of a line or to the end of a line.
a. ^ – anchors text to start of line, ^Virtual
b. $ – anchors text to end of line, Virtual$
c. ^$ – indicates blank line

7.  Defines Character Classes and Sets using Ranges
Regular expressions have an important function of providing a range of options for defining  a specific character.  For example if you were using regular expressions to review /var/log/messages and you had set up your system to keep 6 weeks of logs for messages, you would construct a search like this:

grep ‘some_text’ me*[1-6]

You know you have a series of logs messages, messages.1-6.  In the expression above, the expression  [1-6] covers messages.1-6 because character classes only refer to one character.  This is the same with letters and using ranges for one character.    In this example you see ranges “r-t” so all logs starting with “r-t” and with the archive range of “2-4”.
tail [r-t]*.[2-4]

==> rpmpkgs.2 <==
==> rpmpkgs.3 <==
==> rpmpkgs.4 <==
==> secure.2 <==
==> secure.3 <==
==> secure.4 <==
==> spooler.2 <==
Each of these ranges only matches one character.

8. Ability to Reference Pre-Defined Character Sets
Thee are pre-defined character sets that are available in regular expressions that provide a “short hand” method of accessing these character sets.

a. \a – alert, usually a bell sound
b. \b – backspace, word-boundary
c. \e – escape character
d. \f – form feed
e. \n – newline
f. \r – carriage return
g. \t – horizontal tab
h. \v – vertical tab

9.  Negates Character Classes
Regular expressions can be used to negate specific character classes.  For example, using a “^” inside a character class negates that class.  In this example all of the logs that DO NOT have the character class of “r-t” are matched.  So the “^” inside the character class is not used as an anchor but a negative.

tail [^r-t]*.[2-4]

10. Supports Alternation of Regular Expressions or Words
Regular expressions provide the ability to search for alternating character strings or words inside “[ ]”.  Here is an example of searching for two stings, “fail” OR “Device”.

cat messages | grep -E ‘(fail|Device)’

The strings are separated by the “|” symbol.  You have the option to use more than two strings separated by the pipe.

11.  Supports Alternation of Characters
There is a difference between the alternation of words and the alternation of characters.  Words are text strings that are grouped as an actual string whereas characters, notice it using “ [ ]” instead of “( )”.

cat messages | grep -E ‘[fa|Dev]‘

12.  Supports Repetition with Quantifiers
Another feature of regular expressions is that repetition is supported and controlled by quantifiers.  Here are several examples.
a. * – match 0 or more times, does not fail if no matches
b. + – match 1 or more times of preceding item, fails if not match at least one
c.  ? – match 0 or none characters
d. {n} number of times, match the preceding n number of times

This example of grep is looking for two numbers, the first must be in he range “1-9” and the second number must also be in the range of “1-9”.
ps aux | grep -E 20[1-9][1-9]
avahi     2014  0.0  0.3  23276  1280 ?        Ss   Sep01   0:00 avahi-daemon: running [mail.local]
avahi     2015  0.0  0.0  23148   340 ?        Ss   Sep01   0:00 avahi-daemon: chroot helper
root      2043  0.0  0.1  18416   524 ?        S    Sep01   0:00 /usr/sbin/smartd -q never

By using a quantifier you can perform the same match with a quantifier that matches the “[1-9]” twice “{2}”.
ps aux | grep -E 20[1-9]{2}
avahi     2014  0.0  0.3  23276  1280 ?        Ss   Sep01   0:00 avahi-daemon: running [mail.local]
avahi     2015  0.0  0.0  23148   340 ?        Ss   Sep01   0:00 avahi-daemon: chroot helper
root      2043  0.0  0.1  18416   524 ?        S    Sep01   0:00 /usr/sbin/smartd -q never

You can also provide a range so that in this example the quantifier can be from one to three numbers “1-9”.

ps aux | grep -E ’2[1-9]{1,3}’

13, POSIX Regular Expressions are Greedy
When you discuss alternation for example (vir|virtual|virtualize) which expression is matched.  A greedy match is match that attempts to match the most amount of text.  Whereas a lazy match is a match that attempts to match the least amount of text.  POSIX uses greedy matches that work from let-to-right.  The POSIX standard requires that if you have multiple regular expressions to match, that the expressions with the most text is the required match, this is called the “longest of the leftmost”.

This sounds great until you consider the performance issues.  If you are forcing the testing of each option available in the list before a match is considered then you are getting a better match at the expense of time and resources.  This should also suggest it is important to pay close attention to your regular expression requirements.

{ 4 comments }

cloudlurker November 13, 2010 at 4:26 am

nice overall topic on regular expressions.

anon November 13, 2010 at 5:42 am

Your example in #3 is not actually a regular expression, but is instead simply an example of shell globbing syntax.

Tommy November 22, 2010 at 1:45 pm

“If you are forcing the testing of each option available in the list before a match is considered then you are getting a better match at the expense of time and resources.”

This is misleading, because regular expression engines doesn’t “test each option”. They don’t go back and forth and test different options. Simple regular expressions, like those you have here, describe *recursively enumerable languages*. All these regular expressions are very likely to scan each character just once. It is very efficient.

Of course, there are regexps that don’t describe a recursively enumerable language. Those are slower to match.

Tommy November 22, 2010 at 1:56 pm

And by recursively enumerable language, I really meant regular language. Sorry for the confusion.

Comments on this entry are closed.

Previous post:

Next post: