Pattern Matching with Regular Expressions in C++ LG #27

"Linux Gazette...making Linux just a little more fun!"

Pattern Matching with Regular Expressions in C++

By

Regular expressions are the most flexible way to search for text patterns. Since over twenty years they were used in several Unix tools and utilities such as grep, sed, or awk. This article guides you to implement this Unix base search technique in C++.

Everybody who has worked with a Unix system knows the useful regular expressions. You find them in grep, awk, or emacs for example and they are the most flexible way to specify search patterns. Everybody who has ever used a tool like grep wants never miss its flexibility--or is there anybody who wants?

The great usability of search tools such as grep is a result of regular expressions. Remove these pattern matching technique from grep, substitute it by another search algorithm, e.g., Boyer-Moore, and the resulting tool is a toy--a fast toy, but a toy! (Do you know a DOS tool called find which is a result like this...)

But joking apart. Pattern matching with regular expressions is the basis of many search algorithms in many tools under Unix and so under Linux, too. The power of this search technique is undisputed.

The target of this article shall be the implementation of regular expressions in a reusable C++ class. This article shall be your guide and introduction to the fascinating world of "pattern matching".

Principles

First of all a few principles about pattern matching with regular expressions.

To specify a pattern you have to use a computer processable notation. This notation, or language, is in our case the regular expression syntax.

The regular expression language consists of symbols and operators. The symbols are simply the characters of the pattern. To describe the relations between this symbols the following operators are used (listed in descending priority):

Closure: A string of equal symbols with variable length or an optional symbol. (This is the true heart of pattern matching.)
Concatenation: If there are two symbols in the pattern successive, the corresponding characters in the text will have to be successive, too.
Alternation: One of the alternative symbols must occur in the text in which the pattern is searched.

In addition to these left associative operators brackets can be used to modify the operation priorities.

The closure operators in most regular expression implementations are:

the asterisk (*) which means a repetition of zero or more occurrences of a symbol
the plus (+) which means a repetition of one or more occurrences of a symbol the question mark
(?) which means a optional occurrence of a symbol

Examples: A* matches empty string, "A", "AA", "AAA", etc. A+ matches "A", "AA", "AAA", etc. A? matches empty string or "A".

To specify a concatenation no special operator character must be used. A string consisting of each other following symbols are a concatenation. ABC matches "ABC" for example.

An alternation is described with a "|" between the alternative regular expressions. A|B matches either "A" or "B".

In extended regular expression implementations a few other operators used to describe complex patterns more efficient. But this article shall be only a little introduction into the syntactical possibilities and not a detailed reference.

The Automaton

To search for a pattern which is specified by a regular expression you cannot compare each character of the pattern and the text. Caused by the closure and the alternation there are so many possible ways in complex patterns that they cannot all proved by a "conventional" algorithm. A more efficient technique must be applied. The best way is to build and simulate an automaton for the pattern. To describe a search pattern specified by a regular expression you can use non-deterministic or deterministic finite state automata.

An automaton can assume several states. It can pass from one state into an other depending on a specific input event which is in our case the next input symbol respectively character. And here is the difference between deterministic and non-deterministic finite state automata. A deterministic automaton has only one next state for a specific input symbol. A non-deterministic automaton can have several next states for the same input symbol.

Both kinds of automata can be used for every imaginable regular expression. The two types of automata have there own advantages and disadvantages. For everybody who wants to know more about these automata types in context with regular expressions the book /1/ can be recommended. In our implementation we will use non-deterministic automata. It's the most used strategy to implement a pattern matching algorithm and it's a bit easier to construct a non-deterministic than a deterministic automaton basing on a regular expression.

Figure 1 shows a state transition diagram of a non-deterministic finite state automaton for the pattern a*(cb|c*)d. It contains all types of operations--an alternation, two closures and tree concatenated symbols. Note that the bracket which contains the alternative symbols is one symbol for the concatenation. The start state is the rectangle at the left side. The finite state is shown at the right side--rectangle with diagonal line.

Figure 1. Non-deterministic finite state automaton for pattern a(cd|c)d.

This little pattern and its automaton demonstrates the problems of pattern matching very well. At state No. 7 it is not sure which state will be the next for a input character "c". The states 4 and 9 are possible ways. The automaton has to find out--to guess--the right way.

If the text string "aaccd" shall be matched for example the automaton will start at state No. 0--the start state. The next state, No. 2, is a zero state. This means that there is no character which must match to enter this state.

The first input symbol is a "a". The automaton goes to state No. 1 which is the only way. After matching the "a" the next input character will be read and the next state is No. 2 again. For the next input character which is also a "a". the last two steps are repeated. After this the only possible way is to go to state No. 3 and 7.

Here we are in the state which may cause problems. The next input is a "c". Here we see the true power of the automaton. It can guess the right way which will be state No. 9 and not No. 4. This is the soul of a non-deterministic strategy: the possible solutions are found out. They are not described by an algorithm which works "step by step".

In the real world of programming we have to prove all possible ways, of course. But more about the practical side a bit later.

After the decision pro No. 9 the automaton goes over 9, 8 (1st c matches), 9, 8 (2nd c matches), 10 and 11 (d matches) to state No. 12. The end state was reached and the result is that the text "aaccd" matches to pattern "a*(cb|c*)d".

Design

A regular expression implementation can always be split into a compiler, which generates a automaton from the given pattern, and an interpreter or simulator, which simulates the automaton and searches for the pattern.

The heart of the compiler is the parser which bases on the following context free grammar:

list    ::=           element | element "|" list
element ::=     ( "(" list ")" | v ) ["*"] [element]

This EBNF diagram (=Extended Backus-Naur Form) describes the (reduced) regular expression grammar. It is not possible to explain context free grammars or the EBNF in this article. If you are not familiar with these topics I can recommend /1/ and /3/ for example.

In our sample implementation we will only implement the basic operators | and *. The other closure operators + and ? we will not implement. But with the help of Figure 2 it will no problem for you to implement it, too.

The complete regular expression functionality will be encapsulated in the object class RegExpr. It contains the compiler and the interpreter/simulator. The user is only confronted with the two constructors, one overloaded operator and four methods for compiling, searching, and error handling.

The pattern can be specified by calling the constructor RegExpr(const char *pattern), by using the assign operator = or the compile(const char *pattern) method. If re is an object of RegExpr the following lines will set the pattern "a*(cb|c*)d":

RegExpr re("a*(cb|c*)d");
or RegExpr re; re = "a*(cb|c*)d"; 
or RegExpr re; re.compile("a*(cb|c*)d");

To search in a text buffer or string you can use the methods search() and searchLen(). The difference between these methods is that searchLen() expects a reference to a unsigned integer variable as an additional parameter. In this variable the length of the matching substring is return. Note that the closures, but also the alternation, cause that the length of the found substring can vary, e.g., a* matches "", "a", "aa", etc.

In tools, such as grep, you won't need this additional information. Here you can use search instead of searchLen(). This method is a simple inline which calls searchLen() with a "dummy" variable.

Figure 2. These are the automata for the closure implementation.

The error handling is complete exception based. If the compiler indicates a syntax error in the currently processed regular expression it will throw an exception of type xsyntax. You can catch this exception in your program and call the method getErrorPos() which returns the character position at which the error occurred. This may look like this:

try {
        re.compile("a*(cb|c*)d");
} catch(xsyntax &X) {
        cout << "error near character position "
<<
        X.getErrorPos() << endl;
}

Another error which can occur is "out of memory". This error--caused by the new operator--isn't uniform processed by current C++ compilers. gcc for example handle such an error with a program termination. Some compiles throw exceptions. The rest does absolutely nothing and waits for other errors which will definitely occur. You solve this problem in every ANSI C++ compiler by using the function set_new_errorhandler() (declared in new.h) to set a routine to handle this error. In most cases I write a little routine to throw an exception which indicates this error type and set this routine as error handler for the new operator. This is by the way an easy solution to program a portable error handling which can be used by all ANSI C++ compilers and under every operating system.

A RegExpr object contains a method called clear_after_error() to clear itself when a error occurred respectively a exception was thrown. A call of this method is necessary because an error leaves the compiler or simulator in a indefinable state which can cause fatal errors at other method calls.

The Compiler

The grammar which was previously shown in an EBNF diagram is implemented in the methods list, element and v. list and element represent the productions of the EBNF. v is a method which implements the special symbol v. This symbols means in the grammar a character which is not a metacharacter (|, *, etc.). It can also be a backslash sequence like \c where c is any character. By using the backslash the special significance of a metacharacter can be removed.

This three methods operate on a array called automaton. The array consists of struct variables which contain information of the states of the automaton. Every state entry contains the indices of the next state(s) and the character which have to match. If the state is a zero state this information will be a zero byte ("\0").

Figure 3. The parse tree of "a*|b".

Our implementation is a top down parser. It uses directly the recursive strategy of the context free grammar--every production is coded as a function. The parser splits the who pattern respectively regular expression into lower parts until it reaches a terminate symbol. Figure 3 shows the parse tree for "a*|b". First list is entered which branches into non-terminate element, terminal "|" and non- terminate list. element detects v and "*" and goes down to "a". The other list part goes directly down to "b" by passing element and v. The parse tree of our sample regular expression can be seen in Figure 4.

Figure 4. The parse tree of "a(cb|c)d"

Every non-terminate symbol represents a function call in our parser. The top down strategy is the easiest way to implement a parser from a context free grammar respectively EBNF diagram. You see the most important thing here is an error free grammar specification!

Inside this methods the states of the automaton are generated. For every character of the regular expression a state will be created. The only exception is the operator | which will be compiled to two states.

The methods return to its caller always the entry point (index of state) to the generated part automaton. The end of the part automaton is always the last current state which index is stored in the attribute state of RegExpr. You see the several part automata in Figure 5.

Figure 5. A Several Part Automata

The red numbers indicate the new generated states for the operation. The succession of the numbers is defined by the parser which reads a string from left to right. The returned entry point or state is marked, too. You realize that it is very important to tell the calling function where the entry point is because it isn't always the one with the lowest index!

The states--and so the whole automaton--are generated in this way step by step by a top down parser. It isn't very helpful for you to write more about this automaton creation. It's better you type in the sources, compile it and watch the algorithm in action by using a debugger.

A little annotation to the automaton. It is implemented by the static array automaton in RegExpr. This is definitely a poor rudimentary implementation. In a practical and useful program you have to use a better strategy, e.g. an aggregate object class in RegExpr which works with a dynamic array.

Note that this implementation of the automaton can cause fatal errors if the pattern is to large! It has no checking function which breaks pattern compilation if there are no more states.

But it is not difficult to implement the automaton as class which administrates it in a dynamic array or a linked list.

The Automaton Simulation

After the compilation of the pattern we can execute the generated code respectively simulate the automaton. The complete intelligence of the search algorithm is implemented in the method simulate().

It was previously hinted that the automaton guesses the right answer but this a theoretical view. A computer simulation of a non-deterministic finite state automaton must prove every possible matching way through this automaton. Sedgewick (/3/) has implemented a interesting algorithm to do this. Our algorithm shall base on this technique.

Sedgewick's system has some disadvantages for practical application. The First disadvantage is that its grammar needs a character after a closure otherwise it can't find it. But this is a problem which can be solved by a patch very soon--and our implementation has already solved this. The second problem is a bit more complex. Sedgewick's implementation quits after the first match. This means that it doesn't find the longest matching string. For example: If you search for "a*a" in "xxaaaaaxx" it will find only "a" instead of "aaaaa". Our implementation will solve this problem.

The idea that a program can guess the right way might sound ridiculous. But the heart of such a software is to prove all possible way and accept the last matching as the right. Here is a parallel proceeding the decisive key.

Every branch of the automaton will be tested and if not fitting removed. It's a bit a "trial and error" method. Every possible way will be tested parallel with the others and removed when not matching the current processed character of the search through text.

The basic element of this algorithm is a deque. A deque is a double ended queue. It's a hybrid between stack and buffer. You can push and pop data (stack) but also put (buffer). In other words: You can add data to the head and to the tail of this data structure.

This behavior is important for our algorithm. The deque is split into two parts:

top part for the current processed character of the search through text
bottom part for the next character

The next state values of zero states are stored on the top part because they implement the structure which is necessary to detect a match of the current character. The next state values of non-zero states (the_char != '\0') are put to the bottom part because they point to the next character. Between these part is a special value stored--next_char (-1). It indicates that the next character of the text shall be processed.

The main loop in simulate gets a state from this deque and tests the conditions. If the character in the_char of this state matches the current processed character in the searched through text the indexes of the next states (next1 and next2) will be put at the end of the deque. If the read state is a zero state the next state values will be put on the start of the queue. If the state is next_char (-1) the next character in the searched through text will be processed. In this case next_char is put at the end of the deque.

The loop will be terminated if the end of the text is reached or the deque is empty. The last case arises when no more matching parts are found.

As so far it sound like the version of Sedgewick but the difference is that when state becomes zero the loop won't terminate. It is accepted as a matching part and this information is stored but the loop will go ahead! It will search for possible other matches.

After the termination of the loop I returns the last matching result or--if the pattern wasn't found--the start position of the search minus one.

A Little Sample Application

To download the listings and makefile, click here.

eg.cc is a little egrep implementation. It shall demonstrate the usage and the power of RegExpr. eg reads from standard input or from a optional specified file and print every line which contains the pattern:

Usage: eg pattern [file]

RegExpr is in this (minimal) implementation not perfect of course but it will be a good basis for experiments. A few things which can be changed or implemented are:

The closures ? and +
Implementation of automaton as object with dynamic array or linked list as state administration
Character classes [...]
Metacharacter . for any character
Other operators known from sed or ed {...}
Start and end line metacharacters--^ and $

Last but not least I hope that you will have a bit fun with this implementation. If you have some suggestions, questions, or ideas please let me know.

Resources

Aho, Alfred V. / Sethi, Ravi / Ullman, Jeffrey D.: COMPILERS Principles, Techniques and Tools. Reading (Mass.): Addison Wesley 1986
Müller, Oliver: Mustererkennung mit Delphi-Suchen und Ersetzen von regulaeren Ausdruecken. mc extra (DOS International) 7/96
Sedgewick, Robert: Algorithms. Reading (Mass.): Addison Wesley 2nd Ed. 1992