Regex Tutorial [Part – 4]

In previous post, there was discussion about groups and alternation used in regex. This post explains about common shorthand symbols and inline modifiers in regex.

Shorthand Symbols

There are many shorthand symbols available in regex. Here is a list of few of them

Shorthand SymbolsMeaning
\tA tab
\nA new line
\rA carriage return
\sA whitespace character (vertical space, horizontal space, new line etc.)
\SA non-whitespace character (equivalent to [^s])
\dA digit(equivalent to [0-9])
\DA non-digits(equivalent to [^0-9] or [^\d])
\wAn alphanumeric or _(equivalent to [a-z0-9A-Z_])
\WAnything other than non alphanumeric or _(equivalent to [^a-z0-9A-Z_] or [^\w])
\bWord boundary
\BNon-word boundary

While most of the symbols are self-explanatory, \b needs an explanation. e.g. regex cat will be matched in string tom cat as well as tomcatx. On other hand if regex used is \bcat\b, it will match tom cat but won’t match tomcatx or tomcat or cattom. So  \b acts as word boundary and is of zero-width i.e. it do not consumes any string.

Positions of word boundary include:

  1. Start of string
  2. End of string
  3. Between two characters where one is word character( \w) and other is non-word character( \W).

e.g. the positions mentioned in below string qualifies for word boundary by corresponding rules mentioned above(spaces only for clarity here)

Inline Modifiers

Inline modifiers can be used alongside regex to specify some actions. Most commonly used of them are

Inline ModifiersMeaning
(?i)Ignore case. Matches the pattern without case sensitivity
(?s). matches new line as well as carriage return
(?m)^ and $ acts as start and end of line

Inline modifiers is used in regex as:  (?i)aBCd

  1. If there are multiple inline modifiers, it can be specified as (?smi).*.
  2. Unless specified, entire string including new line is considered on which regex pattern is matched. (?m) modifier specifies that the pattern should only match within a line and not in entire string. For a string, there can be multiple lines obtained after splitting on \n.

Summary

This part explained two of the most useful concepts in regex. Next part will dive further in regex.

Regex Tutorial [Part – 3]

In previous post, there was discussion about metacharacters used in regex. This post explains groups and alternation.

6. Groups

6.1 Capturing groups

Capturing groups are represented using () in regex. Everything inside a group succeeds or fails as a single unit. Use of capturing group(s) is:

  • to remember what is matched.
  • backreference the matched data later in pattern.

Anything written inside () is stored in a special variable:  \1, \2 etc. These variables can be assumed as memory place to store matched data. \1 stores the content of the first capturing group, \2 stores the content of second capturing group and so on.

e.g. regex pattern (abc(de)) denotes store the matched string abcde in first capturing group i.e. \1 and de in second capturing group i.e. \2. These capturing groups can be used to backreference when needed. The ordering of group is from left to right.

Backreference

Coming back to regex shown in first post for finding the first repeating character (.)\1. For string abcdddefff, following steps take place-

  • Match a and store it in first capturing group \1. Now the next character should be same as that stored in  \1.
  • If the match fails, backtrack one character and then repeat the above step until match is found or we reach end of string after failing.

1) regex101 1

6.2 Non capturing groups

Non-capturing group is used to avoid storing the matched data. These are denoted using (?:). This ensures that the matched data should not be stored anywhere.

Points To Remember
  1. There is also a 0’th group which returns the entire content matched by the pattern.
  2. Groups can be used in combination with metacharacters. e.g. following patterns matches and stores result in first capturing group
    ([a-z]+):- Match one or more character in range a-z
    ([0-9])?:- Match one character in range 0-9(optional)
    ([A-Z]{1,10}):- Match at least 1 and at most 10 characters in range A-Z

7. Alternation

Alternation is OR condition. It is denoted by |. e.g. to match cat or dog, pattern will be  cat|dog.

In previous post, there was discussion about character class which is simple form of alternation only meant for characters. Character class cannot be used as OR condition of words.

Summary

This part focused on groups and alternation. Next part will deal with other aspects of regex.

Regex Tutorial [Part – 2]

In previous post, there was discussion about brief history and application of regex. Its time to get some basics understanding of regex. This post is about metacharacters.

Metacharacters:- Characters having special meaning for regex engines are known as meta-characters.

1. Starting and ending of string

regex provides  ^ and  $ metacharacters indicating starting of string and ending of string respectively. These meta-characters are line anchors and are of zero width, meaning that they do not consume any character(s).

2. Match any character

. meta-character in regex allows to match any character in the string except new line( \n ) and carriage return( \r )

3. Quantifiers

As the name suggests, quantifiers is something related to counting. There are 4 types of quantifiers supported in regex.

NOTE:- Meaning of group will be explained in later  post

SymbolsMeaning
?Matches the previous character or group if its possible. e.g.in string ba?c match b followed by a followed by c. So the regex can match the string bac as well as bc because a is optional
*Matches 0 or more character/group (maximum possible). e.g. regex a* can match the string a, aa, aaaaaaa etc. as well as it can match an empty string or string with NO a because it is happy to match zero character
+Matches one or more character/group (maximum possible). e.g. regex a+ can match the string a,aa, aaaaaaa etc. but NOT an empty string unlike a*
{min, max}Matches character/group at least min times and at most max times. Depending upon the requirement the interval can be open like {min,} meaning match at least min times but the open interval cannot be {,max}

4. Character Class

4.1 Character Class

Character class is denoted by []. Content inside a character class is treated as single character separately. e.g. [12345] means match 1 or 2 or 3 or 4 or 5. In simple words it can be understood as OR condition for single characters.

POINTS TO REMEMBER
  • In character class, there is no concept of matching a string. So in character class [cat] it does not mean that it should match the word cat literally. It means that it should match either c or a or t.
  • Sometimes people use | (alternation) inside character class thinking it will act as OR condition which is wrong. Using [a|b] actually means match a or | (literally) or b.

4.2 Range in character class

Range in character class is denoted using  sign. To match any character in English alphabets A to Z, following can be used  [A-Z].

This can be done for any valid ASCII or unicode range. Most commonly used ranges include [a-z] or [0-9]. Moreover these ranges can be combined in character class as [A-Za-z0-9].

It means that match any character in the range A to Z or a to z or 0 to 9. The ordering doesn’t matter. So the above is equivalent to [a-zA-Z0-9] as long as the range defined is correct.

POINTS TO REMEMBER
  • Sometimes when writing range people write it mistakenly as [A-z]. This is incorrect as we are using lower case z instead of capital Z. So this denotes match any character from ASCII range 65 (of A) to 122 (of z). This includes many unintended character after ASCII range 90 (of Z).
  • Meaning of  inside character class is special. It denotes range as explained above. What if we want to match  character literally? We can’t put it anywhere otherwise it will start denoting ranges. In this case we have to put  in starting of character class like [-A-Z] or in end of character class like [A-Z-] or escape it if it is to be used in middle like [A-Z\-a-z].

4.3 Negated character class

Negated character class is denoted by [^]. The caret sign ^ denotes match any character except the one present in character class. e.g. [^cat] means match any character except c or a or t.

POINTS TO REMEMBER

The meaning of caret sign(^) maps to negation only if its in the starting of character class. If its anywhere else in character class, it is treated as normal caret character without any special meaning.

5. The Great Escape

The list of metacharacters discussed here are ^,$, ?,*,+,{,},and ]. Now comes the question, what if we want to match these characters literally?? To match these characters literally, just escape it using \. So to match $ use \$. In simple words to match any metacharacter, escape it.

Summary

This post dealt with matacharacters that are commonly used in regex. Next part will introduce further details in regex.

Regex Tutorial [Part – 1]

What is regex and how is it useful for anyone?

Regular Expression (also known as regex, and we will call it this only from now on) is one of the most powerful tool that can be used to manipulate text. Text manipulation that would otherwise require multiple lines of code can be done in single line using regex.

At first, regex seems to be cryptic as well as complicated but its not. Once someone understand its concept, they will appreciate its beauty, simplicity and power. It can be used by anyone. Someone does not need to be an expert developer for understanding it.  It is widely used in:

  1. Data Cleaning in Machine Learning
  2. Manipulating text in Text Editors/Excel
  3. Writing code to perform various stuffs on text

and many more places….

A brief History

Perl was one of the first programming language that included the support for regex. Later on developers from different programming languages realised the essence of regular expressions and started using regex engines in different programming languages. The most popular engine is known as PCRE (Perl Compatible Regular Expression). The PCRE library is a set of functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5 1. These languages include PHP, Java, Python and many others.

An example where regex is handy

NOTE: This section can be safely skipped by beginners/people with no coding background.

Example: Find the first repeating character in string abcddef (d here)

Code in some programming language

Regex Solution

Using above and functions provided by regex library of the programming language, we can find what is needed.

Summary

regex is a tool that everyone should know of. This part was just an introduction about history of regex and places where it can be used. Next part will dive more into its functionality.