The following examples illustrate a few specifications that could be expressed in a regular expression:
- The sequence of characters "car" appearing consecutively in any context, such as in "car", "cartoon", or "bicarbonate"
- The sequence of characters "car" occurring in that order with other characters between them, such as in "Icelander" or "chandler"
- The word "car" when it appears as an isolated word
- The word "car" when preceded by the word "blue" or "red"
- The word "car" when not preceded by the word "motor"
- A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, "$100" or "$245.99").
Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. Some of these languages, including Perl, Ruby, Awk, and Tcl, have fully integrated regular expressions into the syntax of the core language itself. Others like C, C++, .NET, Java, and Python instead provide access to regular expressions only through libraries. Utilities provided by Unix distributions—including the editor ed and the filter grep—were the first to popularize the concept of regular expressions.
As an example of the syntax, the regular expression \bex can be used to search for all instances of the string "ex" that occur after "word boundaries" (signified by the \b). Thus, in the string "Texts for experts", \bex matches the "ex" in "experts" but not in "Texts" (because the "ex" occurs inside a word and not immediately after a word boundary).
Many modern computing systems provide wildcard characters in matching filenames from a file system. This is a core capability of many command-line shells and is also known as globbing. Wildcards differ from regular expressions in generally expressing only limited forms of patterns.
The Perl You Need to Know: Basic Regular Expression Syntax | ||
. | any single character | The dot (.) can be used as a placeholder for any character. Examples: "do." would match "dog", "dot", "doe", etc. "d..r" would match "door" and "deer". |
* | zero or more of the previous character | The asterisk (*) specifies that zero or more instances of the previous character should exist in sequence. Examples: "do.*" would match "dog", "done", "doppleganger", etc. (why? "d-o- followed by zero or more of any chararcter") "to*" would match "to" and "too" (why? "t-o- followed by zero or more o's") "fre*.." would match "frat", "free", "from" (why? "f-r- followed by zero or more e's followed by any two characters) |
+ | one or more of the previous character | The plus sign (+) demands that there be at least one of the previous character in sequence; similar to (*) but slightly more strict. Examples: "fre+.." would match "freak", "freeze", "fresh" (why? "f-r- followed by one or more e's followed by any two characters) |
? | zero or one of the previous character | The question mark (?) says that there should be zero or one of the previous character but not more than one. This is stricter than either (*) or (+). Examples: "ton?e" would match "toe" and "tone" (why? "t-o- followed by zero or one n followed by e") |
| ( ) | grouping | The parentheses ( ) are used to group together patterns, for instance, to logically combine two or more patterns. Example: (dog|cat) would match "dog" and "cat" (why? "dog or cat") |
[] | any character from the set | The square brackets ([]) can be used as a placeholder for a single character which matches any of a set of characters. Confusing, at first, but some examples should clarify: "ta[pb]" would match "tap" and "tab" (why? "t-a- followed by one character from the set of pb") "r[aeiou]t" would match "rat", "ret", "rot", "rut" (why? "r- followed by one character from the set of vowels followed by t") "r[aeiou]+t" would match "rat" (plus all of the above), "riot", "root", etc. (why? "r- followed by one or more vowels followed by t") |
[^] | any character not from the set | Placing a carat (^) inside the square brackets ([]) negates the set; meaning the character must match any character not within the set. This is a useful way of specifying a large set of characters, for instance, consonants are "not vowels"; examples: "t[^aeiou]+.*s" matches "thanks", "this", "trappings", etc. (why? "t- followed by one or more of any character which is not a vowel followed by zero or more of any character followed by an s") |
| {min,max} | range of occurrences | The curly braces ({}) are used to require that the preceding character or set of characters occur a certain number of times. Examples: "[a-z]{3}" would require that a lowercase letter appear 3 consecutive times. "[0-9]{3,}" would require that a digit appear 3 or more consecutive times. "[A-Z]{2,5}" would require that an uppercase letter appear between 2 and 5 consecutive times. |
| Character Classes | Anchor Sequences | ||
| \d | Any digit [0-9] | ^ | Beginning of data string |
| \D | Any non-digit [^0-9] | $ | End of data string |
| \w | Any alphanumeric [a-zA-Z0-9_] | \b | A word boundary |
| \W | Any non-alphanumeric [^a-zA-Z0-9_] | \B | Any place except a word boundary |
| \s | Any space [ \t\n\r\f] | ||
| \S | Any non-space [^ \t\n\r\f] | ||
| Escape Sequences | |
| \n | Newline character, aka linefeed. This is the typical end-of-line character. |
| \r | Carriage return character. |
| \t | Tab character. |
| \e | Escape character. |
| \xFF | A hexadecimal value in place of "FF". |
No comments:
Post a Comment