Loading…

Regular expressions for beginners

 

Обложка поста

What are regular expressions (RegEx)?

In case you have ever worked with the command line, you probably have used file-name masks. For example, in order to delete all the files of the current directory that start with “d”, you can run rm d*.

 

Regular expressions are similar to file-name masks, but are much powerful tool for searching strings, checking them for compatibility with a particular template and another project. The term names are Regular Expressions or just RexExp. Strictly speaking, regExp is a special language for describing strings’ templates.

 

The tool implementation differs in various programming languages, but not too much. In this article, we will initially orientate on the implementation of Perl Compatible Regular Expressions.

 

Syntax basics

First, we should state that any string is a regexp. This way, the statement “Haha” will correspond to ‘Haha’ string, and only to it. Regexp is case sensitive, and, for this reason, the string ‘haha’ (using lowercase) will not correspond to the mentioned above statement.

 

However, you need to be safe here. Like any other language, regexp has special symbols that should be screened. Here is the full list: . ^ $ * + ? { } [ ] \ | ( ). Screening is accomplished in a usual way – just adding \ before the special symbol.

 

The set of symbols

Let’s suppose that we want to find all the interjections that mean a laugh. Just Haha will not be enough, because “Hehe”, “Hoho”, and “Hihi” will not correspond to it.

 

Here sets will come to our aid – instead of specifying a specific symbol, we can write a whole list, and if any of the listed symbols will be in the specified place in the string, the string will be considered suitable. Sets are written in square brackets – the pattern [abcd] will match any of the characters “a”, “b”, “c” or “d”.

 

Inside the set, most of the special symbols do not need screening, but using \ before them will not be considered as an error. It is necessary to avoid the characters “\” and “^”, and, preferably, “]” (for example, [] [] means any of the symbols “]” or “[”, while [[]]] is an exclusive sequence “[x]”). Unusual at first glance, the behaviour of regular expressions with the symbol “]” is in fact determined by certain rules, but it is much easier to simply escape this symbol than to keep them in mind them. Additionally, the “-” symbol must be avoided; it is used to specify ranges (see below).

 

In case you write ^ after [ symbol, the set will acquire the opposite meaning – any symbol other than those specified will be considered suitable. So, the pattern [^xyz] matches any symbol exclusive for “x”, “y” or “z”.

 

Thus, if we write [Xx] [aoie] x [aoie], then each of the lines “Haha”, “Hehe”, “Hihi” and even “Hoho” will correspond to the pattern.

 

Predefined Symbol Classes

There are special templates for some sets that are used quite often. This way, in order to describe any whitespace symbol (space, tab, line break), you need to use \ s, for digits – \ d, for Latin symbols, digits and “_” – use \ w.

 

If you need to describe just any symbol, you “.”. If you match the specified classes with an uppercase (\ S, \ D, \ W), then they change their meaning to the opposite – any non-blank character, any character that is not a digit. , and any character except Latin, numbers or underscores, respectively.

 

Also using regular expressions it is possible to check the position of the line relative to the rest of the text. The expression \ b denotes the word boundary, \ B is not the word boundary, ^ is the beginning of the text, and $ is the end. So, by the \ b Java \ b pattern in the “Java and JavaScript” line there are the first 4 characters, and by the \ bJava \ B pattern – the characters from the 10th to the 13th (included in the word “JavaScript”).

Ranges

You may need to designate a set that includes letters, for example, from “b” to “o”. Instead of writing [bcdefghijklmno], you can use the range mechanism and write [bo]. Thus, the pattern x [0-8A-F] [0-8A-F] corresponds to the string “xA14”, but does not correspond to “xb17”.

Quantifiers (the number of iterations)

Let’s return to our example. What should be done in case there is more than one vowel in the “laugh” interjection between the letters “h”, for example, “Haaaaaa”? Our previous reпexp will not be able to accomplish this task. Thus, we need to use quantifiers here.

 

Quantifier

Number of iterations

Example

Suitable strings

{n}

Exactly n times

Ha{3}ha

Hаааhа

{m,n}

From m to n inclusive

Hа{2,4}Hа

Hаа, Hааа, Hааааха

{m,}

Not more than m

Hа{2,}hа

Hааhа, HаааHа, Hааааhа and so on

{,n}

Not more than n

Hа{,3}hа

Hhа, Hаhа, Hааhа, Hаааhа

 

You need to pay attention to the fact that quantifier is applicable to the foregoing symbol only.

Some often used constructions in terms of regexp received special specifications:

Quantifier

Analogue

Value

?

{0,1}

0 or 1 iteration

*

{0,}

0 or more

+

{1,}

1 and more

This way, with the help of quantifiers, we can improve our template for interjections till [Hh], [aoei]+h[aoei] *, and it will be able to identify such strings as “Haaha”, “heeeeeh, and “Hihii”.

 

Lazy quantification

Let’s suppose that we have a task to find all the HTML tags in the article. 

 

<p><b>WriteAbout</b> is my <i>favourite</i> website on programming!</p>

 

An obvious solution <.*> will not work in this case as it will find the full string as it starts an ends with a para. It means that the following will be considered a tag value:

 

p><b>WriteAbout</b> is my <i>favourite</i> website on programming!</p

 

It happens so because, by default, the quantifier works following the so-called Greed Algorithm – tries to return as long string as possible corresponding to the condition. There are two ways to solve this issue. The first is to use the expression <[^>] *>, which will prohibit the right angle bracket to be considered the tag value. The second is to declare the quantifier not greedy, but lazy. You can do it by adding the symbol to the right quantifier. This way, in order to search for all tags, you need to remake the expression to <. *?>.

Jealous quantification

Quite often, in order to increase the search speed (especially if the string does not match the regular expression), you can use the prohibition of the algorithm to revert to the previous search steps in order to find possible matches for the rest of the regexp. This is called jealous quantification. A quantifier is made jealous by adding a + to the right. Another application of jealous quantification is the elimination of unwanted matches. So, the pattern ab * + a in the “ababa” string will correspond only to the first three characters, but not the third to the fifth characters, since the “a” symbol, which is in the third position, has already been used for the first result.

Bracket groups

For our template of “laugh” interjection, we also need to take into account that the letter “h” can occur more than once, for example, “Hahahahaaaaahoo”. Moreover, it may even end with the letter “h”. Probably, here we need to apply a quantifier for the group [aioe] + h, but if we just write [aoe] x +, then the quantifier + will refer only to the character “h” and not to the whole expression. In order to solve this problem, the expression must be taken in parentheses: ([aii] h) +.

 

Thus, our expression turns into [Hh] ([aioe] h?) + – first comes the capital or lowercase “h”, and then an arbitrary non-zero number of vowels. They are interspersed with single lowercase “h”. However, this expression solves the problem only partially. The matter is that such strings as, for example, “hihaheh” will also fall under this expression. Obviously, you can use a set of all vowels only once, and then you need to somehow rely on the result of the first search. You may be wondering how to accomplish this, right? 

 

Remembering the group search result (feedback)

It turns out that the search result for the bracket group is written in a separate memory cell, access to which is available for use in the subsequent parts of the regular expression. Returning to the task of finding HTML tags on the page, we may need not only to find the tags but also to find out their name. The regular expression <(. *?)> will help us with this.

 

<p> <b> WriteAbout </ b> is my <i> favorite </ i> programming site! </ p>

 

Search result for all regular expressions: “<p>”, “<b>”, “</ b>”, “<i>”, “</ i>”, “</ p>”.

 

The result of the search for the first group: “p”, “b”, “/ b”, “i”, “/ i”, “/ i”, “/ p”.

The group search result can be referenced using the expression \ n, where n is a digit from 1 to 9. For example, the expression (\ w) (\ w) \ 1 \ 2 matches the strings “aaaa”, “abab”, but does not match “aabb”.

 

If the expression is bracketed only to apply a quantifier to it, then you should immediately add ?: to the first bracket. For example (?: [Abcd] + \ w).

 

Using this mechanism, we can rewrite our expression to the form [Hh] ([aoie]) h? (?: \ 1h?) *.

 

I love or hate regex

 

Narration

In order to check whether a string satisfies at least one of the patterns, you can use an analogue of the Boolean operator OR. It can be written in the following way: “|”. Thus, the pattern “Anna” and “Loneliness”, respectively, fall under the pattern of “Anna | Loneliness”. It is especially convenient to use narration inside bracket groups. So, for example (?: A | b | c | d) is fully equivalent to [abcd]. As you can see, the 2nd option is more preferable because of its performance and readability.

With the narration operator, you can add the ability to recognize laughter like “Ahahaah” to our regular expression of interjection search. It is the only type of laugh that starts with a vowel: [Hh] ([aoie]) h? (?: \ 1h?) * | [ Aa] h? (?: ah?) +.

Useful services

You can train or check your regexp with any text without code writing with the help of such platforms as RegExr, Regexpal and Regex101. The latter, in addition, provides you with a brief description of a particular regexp operation.

In order to understand how a particular regexp works, you can use the Regexper service. The matter is that it is able to build easy-to-understand diagrams on regular expressions. 

RegExp Builder is a visual JavaScript function builder for working with regular expressions.

Tasks for consolidation

Find time

Time has the format of hours: minutes. Both hours and minutes consist of two numbers, example: 09:00. Write a regular expression to find the time in the line: “Breakfast at 09:00”. Please note that “37:98” is incorrect.

Java[^script]

Will the regexp Java[^script]find something in “Java” string? What about “JavaScript” string?

Colour

Create a regexp for finding HTML colour declared as #ABCDEF. Thus, there will be # and 6 hexadecimal symbols.

Sort our arithmetic expression

An arithmetic expression consists of two numbers and operation between them. For example:

  • 1 + 2

  • 1.2 *3.4

  • -3/ -6

  • -2-2

 

The list of operations: “+”, «-», “*” and “/”.

 

Also, there can be spaces around the operator and numbers.

Write a regexp that will find both an arithmetic expression and two operands.

 


 

 

We wish you good luck and happy coding!

 


Leave a Comment