String Matches in Regular Expressions

by mike on September 10, 2011

Parentheses
The parentheses allows you to use regular expressions to perform string matches, or matches to actual words.   Thus you may write a search that is looking for “virtual” or “main” with (virtual|mail).  Note they are separated by a pipe.  These are string searches so you are looking for “virtual” not “v” or “i” or “r” etc.

(virtual|mail)    – string search for words “virtual” or “mail”

[virtual]        – character search for any of “v” or “i” or “r” etc.

The emphasis is to understand and recognize the power and differences in using “(strings)”  vs. “[characters]”.

The parentheses can provide two valuable options.  First, as mentioned above it can provide the ability to group characters into a string for matching purposes.  This could also be known as a character set, which is different from a character class.  Think of the character set as a text string or word.

Parentheses allows you to use regular expressions and alternation.  Alternation will use a pipe to connect two strings and search for either one.  This example demonstrates the search for two search strings “sys” and “net”.  Note that “sys” is placed first but the results are alphabetical.

ls /etc/sysconfig | grep -E "(sys|net)"
netconsole
network
networking
network-scripts
syslog
sysstat
sysstat.ioconf
system-config-securitylevel

The pipe symbol can be used more than one time, in fact as many times as you want.

ls /etc/sysconfig | grep -E "(sys|net|grub)"
grub
netconsole
network
networking
network-scripts
syslog
sysstat
sysstat.ioconf
system-config-securitylevel

If you apply positional anchors you are able to focus the output to those strrings that start a line.

ls /etc/sysconfig | grep -E "^(sys|net|grub)"
grub
netconsole
network
networking
network-scripts
syslog
sysstat
sysstat.ioconf
system-config-securitylevel

Or, you can focus the output on those strings that make up the whole line.  By starting the search and requiring both a start of the line “^” and an end of the line “$” you limit the output.

ls /etc/sysconfig | grep -E "^(sys|net|grub)$"
grub

The parentheses is this case allows the block of characters to be used as a token.

The other powerful feature of parentheses is that it will allow you to nest your expressions.  This is known as sub-expressions.  Where as a alternation is achieved by using:

(virtual|main)

Nesting is provided with this structure.

(virt(ual)?)

This structure will search for “virt” which is required.  But it will also search for “virtual” with the “ual” being optional.  The “?” makes the last text string optional.

It is easy to confuse communication using the English language so let me clarify to help out here as several have made comments.
Clarification of quantifiers:
*   – match 0 or more times
+  – match 1 or more times
?  – match 0 or 1 times

In this example grep is used to search for text strings that have “ip”, optionally strings that have “ip6” are included.  If no files were found with “ip6” the search is still a success and will return a “0” or success value because the “6” is optional based on the “?”.
ls /etc/sysconfig | grep -E "(ip(6)?)"
ip6tables
ip6tables-config
iptables
iptables-config
network-scripts

The output changes if you require “6” by adding the “+” which must have a match.

ls /etc/sysconfig | grep -E "(ip(6)+)"
ip6tables
ip6tables-config

Of course this could also be written this way.

ls /etc/sysconfig | grep -E "(ip6)"

Here is another illustration of the same idea.  Nested parentheses provide a way to provide either optional strings or required strings depending if you use “?” or “+”.

ls /etc/sysconfig | grep -E "(ke(r)?)"
kernel
keyboard
ls /etc/sysconfig | grep -E "(ke(r)+)"
kernel

Here is a more complex example with multiple options.

ls /etc/sysconfig | grep -E "(i(p)?(6)?)"
auditd
authconfig
firstboot
hidd
i18n
init
ip6tables
ip6tables-config
iptables
iptables-config
irda
irqbalance
mkinitrd
networking

—cut—

You can see that if you allow “p” to be optional, but require a “6” you only get two results.

ls /etc/sysconfig | grep -E "(i(p)?(6)+)"
ip6tables
ip6tables-config

If you require “ip” and leave “6” optional you get more results.  However, you probably did not want “network-scripts” in the return.

ls /etc/sysconfig | grep -E "(i(p)+(6)?)"
ip6tables
ip6tables-config
iptables
iptables-config
network-scripts

By requiring that the the text string be at the start of the line you can eliminate some returns.

ls /etc/sysconfig | grep -E "^(i(p)+(6)?)"
ip6tables
ip6tables-config
iptables
iptables-config

{ 6 comments }

Jamie September 10, 2011 at 5:16 pm

Just a quick note (as I’ve just been linked to this post), the “+” quantifier doesn’t exactly mean “require”, it means “one or more”. Thus, while it matches ip6tables, it would also match ip66tables, ip666tables and so on and so forth.

mike September 11, 2011 at 12:47 pm

I have added some clarification, seen in red above, to help people understand…thanks for the comment.

Constantin September 11, 2011 at 11:00 am

Very nice explanations.
Thank you.

ronald September 11, 2011 at 11:09 am

You have misunderstood the meaning of ‘?’ and ‘+’, which are “0 or 1″ and “1 or more” respectively.

mike September 11, 2011 at 12:49 pm

Note the updated text in red, my choice of words like “optional” and “required” were maybe not the best choices.

Carl Lowenstein September 12, 2011 at 3:31 am

Note that the symbol ‘|’ is not always “pipe”.
Used in regular expressions, it is “OR”.
Used in shell command lines, it is “pipe”.

Comments on this entry are closed.

Previous post:

Next post: