Saturday 10 August 2013

How to identify a string using regular expressions

A problem that crops up from time to time is to identify which of several possible strings matches a given target. For example, the list might be apple, banana, cherry, date, elderberry, fruit, grapefruit, ... Of course for a short list, just about anything will work fine, including just serially comparing with each of them. If the strings are an exact match, a tree or a hash table will work. It gets more complicated, though, if only a partial match is required, for example to identify which of the list of fruits is contained in the target. The general case is where each of the possible strings is an arbitrary regular expression, i.e. '.*apple.*', '.*banana.*' and so on.

There are well-known algorithms for constructing a suitable search tree, but they don't form part of any conveniently available library that I'm aware of. And they are extremely subtle and complex, especially if you want to accommodate the full power of regular expressions - which means that you'll probably get it wrong in some subtle and difficult to test way if you try to code it yourself.

On the other hand, every system has regular expressions - even C++ in the latest version. It seems fairly obvious to concatenate all the matching strings into one giant regex, separated by '|', and let the regex engine do the work. It surely implements one of the well-known algorithms and has been fully tested. What could be simpler?

Well, except that no regex system I'm aware of gives you an easy way to find out which substring you matched. It will tell you that one of the fruit names is somewhere inside the string, but not which one. It would be nice if there was some way to tag each of the alternatives and then retrieve the tag of the one that matched - but there isn't. You can retrieve the pattern that matched, or part of it, but that just gets you back to where you started.

This problem cropped up again recently for me, in this case matching URLs described by regular expressions. And this time, I thought about it long enough to find a solution, which is to say a way of getting the regex engine to tell me which substring it had matched.

Let's suppose, just to keep the examples simple, that there are fewer than 100 possible strings. As you will see, the technique can be extended to any possible number, but you do need to know an upper bound when you start.

First, append the string '#0123456789,01234567890' to the target string. '#' can be any character or longer string that you are sure will not appear in any of the possible matches.

Next, append to each of the match targets a string like '#.*?(2).*?,.*?(6).*'. The '2' and the '6' in the example should be replaced in each case by a number uniquely identifying the string in question. This is a mildly tricky regex (compared to some of them!) which functions as follows.

  • the leading '#' just ensures that this won't accidentally match anything it shouldn't.
  • .*? will match anything until it finds a '2'. The '?' says that this string should match as little as possible. Without it, the engine would try to match to the furthest '2'. If there are several digits, this would require it to back up several times during the matching process, and would be seriously time-consuming.
  • (2) will match only the digit '2', and, most importantly, will capture it as a substring which can subsequently be retrieved. We know the '2' will be present, because we put it there ourselves.
  • The following '.*?,' matches everything up to and including the following comma. 
  • The following unit repeats the process to capture the '6' from the second group of digits.
For practical use there would be more digits, each repeating the first unit from the example.

Once the string has matched, all regex systems provide a way to retrieve the captured digits, which can then be assembled to get the identifier of the matched substring. In C++, a 'smatch' structure holds the result including the matched substrings. The following code snippet shows how it is done in C++.

int identify(const string &target)
{
    static string digits("0123456789,");
    string t(target);
    for (int i=0; i<digit_count; ++i) {
        t += digits;
    }
    smatch m;                    // structure that holds parse result
    if (regex_match(t, m, master_regex)) {
        string id_str;
        for (int i=0; i<digit_count; ++i) {
            id_str += m[i+1];    // retrieve and append successive digits
        }
        return lexical_cast<int>(id_str);
    } else {
        return 0;
    }
}

Et voila!