Friday, February 11, 2011

Basic Regular Expression Syntax

http://coder.awas.vn/upload/642f9def82f94d18958d16963547405b.jpgIn computing, a regular expression, also referred to as regex or regexp, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

The following examples illustrate a few specifications that could be expressed in a regular expression:

  • The sequence of characters "car" appearing consecutively in any context, such as in "car", "cartoon", or "bicarbonate"
  • The sequence of characters "car" occurring in that order with other characters between them, such as in "Icelander" or "chandler"
  • The word "car" when it appears as an isolated word
  • The word "car" when preceded by the word "blue" or "red"
  • The word "car" when not preceded by the word "motor"
  • A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, "$100" or "$245.99").
Regular expressions can be much more complex than these examples.
Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. Some of these languages, including Perl, Ruby, Awk, and Tcl, have fully integrated regular expressions into the syntax of the core language itself. Others like C, C++, .NET, Java, and Python instead provide access to regular expressions only through libraries. Utilities provided by Unix distributions—including the editor ed and the filter grep—were the first to popularize the concept of regular expressions.

As an example of the syntax, the regular expression \bex can be used to search for all instances of the string "ex" that occur after "word boundaries" (signified by the \b). Thus, in the string "Texts for experts", \bex matches the "ex" in "experts" but not in "Texts" (because the "ex" occurs inside a word and not immediately after a word boundary).

Many modern computing systems provide wildcard characters in matching filenames from a file system. This is a core capability of many command-line shells and is also known as globbing. Wildcards differ from regular expressions in generally expressing only limited forms of patterns.

The Perl You Need to Know: Basic Regular Expression
Syntax

.
any single character The dot (.) can be used as a placeholder for any character.
Examples:

"do."
would match "dog", "dot", "doe", etc.


"d..r"
would match "door" and "deer".


*
zero or more of the previous character The asterisk (*) specifies that zero or more instances of the
previous character should exist in sequence. Examples:


"do.*"
would match "dog", "done", "doppleganger", etc.

(why? "d-o- followed by zero or more of any chararcter")


"to*"
would match "to" and "too"
(why? "t-o- followed by zero or more o's")


"fre*.."
would match "frat", "free", "from"
(why? "f-r- followed by zero or more e's followed
by any two characters)

+
one or more of the previous character The plus sign (+) demands that there be at least one of the
previous character in sequence; similar to (*) but slightly more strict. Examples:

"fre+.."
would match "freak", "freeze", "fresh"
(why?
"f-r- followed by one or more e's followed by
any two characters)

?
zero or one of the previous character The question mark (?) says that there should be zero or one
of the previous character but not more than one. This is stricter than either (*) or (+). Examples:


"ton?e"
would match "toe" and "tone"
(why? "t-o- followed by zero or one n followed
by e")
( ) grouping The parentheses ( ) are used to group together patterns, for instance, to
logically combine two or more patterns. Example:


(dog|cat) would
match "dog"

and "cat"

(why? "dog or cat")

[]
any character from the set The square brackets ([]) can be used as a placeholder for a
single character which matches any of a set of characters. Confusing, at first, but some examples should clarify:

"ta[pb]"
would match "tap" and "tab"

(why? "t-a- followed by one character from the set of pb")


"r[aeiou]t"
would match "rat", "ret", "rot", "rut"
(why? "r- followed by one character from the
set of vowels followed
by t")


"r[aeiou]+t"
would match "rat" (plus all of the above), "riot", "root", etc.

(why? "r- followed by one or more vowels followed by t")

[^]
any character not from the set Placing a carat (^) inside the square brackets ([]) negates
the set; meaning the character must match any character not within the set. This is a useful way of specifying
a large set of characters, for instance, consonants are "not vowels"; examples:


"t[^aeiou]+.*s" matches "thanks", "this", "trappings", etc.

(why? "t- followed by one or more of any character which is not a vowel followed by zero or more of any character
followed by an s")
{min,max} range of occurrences The curly braces ({}) are used to require that the preceding character or
set of characters occur a certain number of times. Examples:

"[a-z]{3}"
would require that a lowercase letter appear 3 consecutive times.

"[0-9]{3,}"
would require that a digit appear 3 or more consecutive times.


"[A-Z]{2,5}"
would require that an uppercase letter appear between 2 and 5 consecutive times.

Character Classes Anchor Sequences
\d Any digit [0-9] ^ Beginning of data string
\D Any non-digit [^0-9] $ End of data string
\w Any alphanumeric [a-zA-Z0-9_] \b A word boundary
\W Any non-alphanumeric [^a-zA-Z0-9_] \B Any place except a word boundary
\s Any space [ \t\n\r\f]
\S Any non-space [^ \t\n\r\f]

Escape Sequences
\n Newline character, aka linefeed. This is the typical end-of-line character.
\r Carriage return character.
\t Tab character.
\e Escape character.
\xFF A hexadecimal value in place of "FF".

No comments:

Post a Comment