Regular Expressions

Regular Expressions Types

Two types of regular expressions exist :

Most of UNIX utilities use the "basic" regular expression but a few like awk and egrep use the extended one.

Regular expressions are implemented in many ways. Extended expressions consist in the same metacharacters as basic expressions with a few additions.

BRE are used by the following command :

ERE are used by :

Other utilities use their own Regex type :

Utility Regex Type
EMACS EMACS Regular Expressions
PERL PERL Regular Expressions

Special Characters

There are characters which have special use in regex : [ ] \ ^ $ . | ? * + ( ) { } -

Most of those special characters have to be escaped with a backslash symbol in order to be treated as literal characters when used in ERE or BRE.

Brackets characters "(" ")" and curly braces characters "{" "}" are exceptions to the rule : their behaviour is different depending on which type of Regular Expressions they are used with : BRE or ERE. Refer to BRE and ERE FEATURES

Anchor Characters and alternation

The character ^ is the starting anchor, and the character $ is the end anchor. The ^ is only an anchor if it is the first character in a regular expression. The $ is only an anchor if it is the last character. If they are not used at the proper end of the pattern, they no longer act as anchors.

The characters < and > are similar to ^ and $ anchors, as they don't occupy a position of a character. They "anchor" the expression between to only match if it is on a word boundary.

The "<" and ">" characters were introduced in the vi editor. The other programs didn't have this ability at that time. Also the "{min,max}" modifier is new and earlier utilities didn't have this ability. This made it difficult for the novice user of regular expressions, because it seemed each utility has a different convention.

Character signification
^ (as first character) start of line
$ (as last character) end of line
\< start of word
\> end of word
| The choice (or set union) operator matches either the expression before or the expression after the operator
. any character
Pattern Matches
^A "A" at the beginning of a line
A$ "A" at the end of a line
A^ "A^" anywhere on a line
$A "$A" anywhere on a line
^^ "^" at the beginning of a line
$$ "$" at the end of a line
\<[tT]he\> the or The
abc|def "abc" or "def"

^ and $ lose their meaning if they are not placed respectfully at the beginning and end of the regular expression. ^ is also used as a special character Character Classes for negative conditions.

Character Classes

You can specify a character class by enclosing a list of characters in [ ]. The class will match any one character from the list.

If the first character after [ is ^ the class matches any character not in the list (negative conditions).

Pattern Matches
hell[aei]r 'hellar', 'heller' and 'hellir'
hell[^ aeiou]r 'hellbr', 'hellcr' but not 'hellar' etc.
^[0123456789]$ Any line that contains exactly one number
^[0-9]$ Any line that contains exactly one number
[A-Za-z0-9_] a single character that is a letter, number, or underscore
[0-9-z] a single character that is a number, or a character between "9" and "z".
[0-9]\] Any number followed by "]"

The hyphen character between two characters specifies a range

Pre-defined classes <>

class Matching
[:alnum:] all letters and all digits
[:alpha:] all letters
[:blank:] all horizontal whitespace
[:cntrl:] all control characters
[:digit:] all digits
[:graph:] all printable characters, not including space
[:lower:] all lower case letters
[:print:] all printable characters, including space
[:punct:] all punctuation characters
[:space:] all horizontal or vertical whitespace
[:upper:] all upper case letters
[: Xdigit:] all hexadecimal digits

The quantifiers

Pattern Matches the last "block"
+ one or more times
* zero or more times
? zero or one time
{n} n times
{n,m} between n and m times
{n,} n or more times

( ) parentheses create a group

Pattern Matches
[0-9]* zero or more numbers
^ * zero or more spaces at the beginning of the line
1+1 111 or 1111 or 111111 etc.
1\+1 1+1 ( + is no more interpreted as a special character because it is escaped)
ba? "b" or "ba"
(abc){2} abcabc
[hc]+at "hat", "cat", "hhat", "chat", "hcat", "ccchat" etc.
"[hc]?at" "hat", "cat" and "at"

More Examples

Pattern Matches
foob.* r 'foobar', 'foobalkjdflkj9r' and 'foobr'
foob.+r 'foobar', 'foobalkjdflkj9r' but not 'foobr'
foob.?r 'foobar', 'foobbr' and 'foobr' but not 'foobalkj9r'
fooba{2}r the string 'foobaar'
fooba{2,}r strings like 'foobaar', 'foobaaar', 'foobaaaar' etc.
fooba{2,3}r strings like 'foobaar', or 'foobaaar' but not 'foobaaaar'
([cC]at)|([dD]og)" "cat", "Cat", "dog" and "Dog"
a\.(\(|\)) "a.)" or "a.("

## BRE and ERE Features ##

BRE

"{" and "}", "(" and ")" have a literal meaning. To be interpreted as special, the "" character has to be added before brackets and braces.

For Example inside a BRE : \{N\} has a special meaning (as a quantifier) while {N} has a literal meaning

ERE

In ERE, it is exactly the opposite. Braces and parentheses lose their special meaning if they are escaped with a backslash.

For example inside an ERE : \{N\} has a literal meaning while {N} has a special meaning (as a quantifier)

Useful Regular Expressions

TO BE DEVELOPED ...

http://regexr.com/

A regular expression (regex or regexp) is a sequence of characters that define a search pattern. Regular expressions are used when you want to search for specific lines of text containing a particular pattern.

var reg=new RegExp("^[0-9]{2}[/]{1}[0-9]{2}[/]{1}[0-9]{4}$","g");
var chaine1="15/12/2003";
var chaine2="1a/bb/2003";
document.write(chaine1+" ");
if (reg.test(chaine1)) {document.write("est bien au format date<BR>")}
else {document.write("n'est pas au format date<BR>")}
document.write(chaine2+" ");
if (reg.test(chaine2)) {document.write("est bien au format date")}
else {document.write("n'est pas au format date")}

If specified characters are special as ! or . they will be escaped with the backslash key \ in order to not be interpreted as special ones \. and \!.

The ! character invokes bash's history substitution. When followed by a string it tries to expand to the last history event that began with that string : `!echo would expand to the last echo command in your history.

Within double quotes, all characters preserve their literal values except $, `,  and !.

That implies the single quotes of '! p' and '!p' have lost their special meanings (i.e.: unable to escape !) but ! still retains its special meaning thus history expansion is performed.

Note: MAN bash

QUOTING


Enclosing characters in single quotes preserves the literal value of each character within the quotes. 

Enclosing characters in double quotes preserves the literal value of all characters within the quotes, with the exception of $, `, \, and, when history expansion is enabled, !. 

If enabled, history expansion will be performed unless an ! appearing in double quotes is escaped using a backslash. The backslash preceding the ! is not removed.

HISTORY EXPANSION

History expansions are introduced by the appearance of the history expansion character, which is ! by default. Only backslash (\) and single quotes can quote the history expansion character.

Several characters inhibit history expansion if found immediately following the history expansion character, even if it is unquoted: space, tab, newline, carriage return, and =. If the extglob shell option is enabled, ( will also inhibit expansion.

However, when ! is followed by a space character, history expansion is not performed.