Regex Tutorial [Part – 2]

In previous post, there was discussion about brief history and application of regex. Its time to get some basics understanding of regex. This post is about metacharacters.

Metacharacters:- Characters having special meaning for regex engines are known as meta-characters.

1. Starting and ending of string

regex provides  ^ and  $ metacharacters indicating starting of string and ending of string respectively. These meta-characters are line anchors and are of zero width, meaning that they do not consume any character(s).

2. Match any character

. meta-character in regex allows to match any character in the string except new line( \n ) and carriage return( \r )

3. Quantifiers

As the name suggests, quantifiers is something related to counting. There are 4 types of quantifiers supported in regex.

NOTE:- Meaning of group will be explained in later  post

SymbolsMeaning
?Matches the previous character or group if its possible. e.g.in string ba?c match b followed by a followed by c. So the regex can match the string bac as well as bc because a is optional
*Matches 0 or more character/group (maximum possible). e.g. regex a* can match the string a, aa, aaaaaaa etc. as well as it can match an empty string or string with NO a because it is happy to match zero character
+Matches one or more character/group (maximum possible). e.g. regex a+ can match the string a,aa, aaaaaaa etc. but NOT an empty string unlike a*
{min, max}Matches character/group at least min times and at most max times. Depending upon the requirement the interval can be open like {min,} meaning match at least min times but the open interval cannot be {,max}

4. Character Class

4.1 Character Class

Character class is denoted by []. Content inside a character class is treated as single character separately. e.g. [12345] means match 1 or 2 or 3 or 4 or 5. In simple words it can be understood as OR condition for single characters.

POINTS TO REMEMBER
  • In character class, there is no concept of matching a string. So in character class [cat] it does not mean that it should match the word cat literally. It means that it should match either c or a or t.
  • Sometimes people use | (alternation) inside character class thinking it will act as OR condition which is wrong. Using [a|b] actually means match a or | (literally) or b.

4.2 Range in character class

Range in character class is denoted using  sign. To match any character in English alphabets A to Z, following can be used  [A-Z].

This can be done for any valid ASCII or unicode range. Most commonly used ranges include [a-z] or [0-9]. Moreover these ranges can be combined in character class as [A-Za-z0-9].

It means that match any character in the range A to Z or a to z or 0 to 9. The ordering doesn’t matter. So the above is equivalent to [a-zA-Z0-9] as long as the range defined is correct.

POINTS TO REMEMBER
  • Sometimes when writing range people write it mistakenly as [A-z]. This is incorrect as we are using lower case z instead of capital Z. So this denotes match any character from ASCII range 65 (of A) to 122 (of z). This includes many unintended character after ASCII range 90 (of Z).
  • Meaning of  inside character class is special. It denotes range as explained above. What if we want to match  character literally? We can’t put it anywhere otherwise it will start denoting ranges. In this case we have to put  in starting of character class like [-A-Z] or in end of character class like [A-Z-] or escape it if it is to be used in middle like [A-Z\-a-z].

4.3 Negated character class

Negated character class is denoted by [^]. The caret sign ^ denotes match any character except the one present in character class. e.g. [^cat] means match any character except c or a or t.

POINTS TO REMEMBER

The meaning of caret sign(^) maps to negation only if its in the starting of character class. If its anywhere else in character class, it is treated as normal caret character without any special meaning.

5. The Great Escape

The list of metacharacters discussed here are ^,$, ?,*,+,{,},and ]. Now comes the question, what if we want to match these characters literally?? To match these characters literally, just escape it using \. So to match $ use \$. In simple words to match any metacharacter, escape it.

Summary

This post dealt with matacharacters that are commonly used in regex. Next part will introduce further details in regex.

Leave a Reply

Your email address will not be published.