This is a space that holds a collection of my personal work and ideas

Efficient Text Highlighter

Posted on 10/02/2021

When I built the search blog post feature of this site, there is a nice visual enhancement called text highlighter as part of the search component. I implemented the same feature with three different approaches, regex positive lookahead, brute force pattern matching, and finally a more efficient approach with Knuth-Morris-Pratt (KMP) text search algorithm. In this post, I will show all three approaches to implement this feature with some interactive visualization to hopefully help a few people see the joy of programming while applying some common algorithms to a practical use case.

Regex Positive Lookahead

When search for pattern matching, it is possible to have overlapping characters that could be a potential match. For example, in the

Pattern

Search text

A regex like this /onion/gi will only result in two matches:

We could tell there is an overlapping occurrence missing from these matches, to match such occurences, we need to ask regex to look ahead of each character before moving on to the next character, similarly to the brute force approach from the next section. This technique is called Positive Lookahead in regex, and the expression is (?=...). It asserts that the given subpattern can be matched here, without consuming characters. Now we update the regex to be /(?=onion)/, we will be getting three matches including the overlapping one.

Brute force pattern matching

The brute force approach is to examine whether the following sequence matches the pattern characters for each character in the search text.

Knuth-Morris-Pratt text search algorithm

The above brute force algorithm in the worst case scenario, will run a comparison as long as the pattern string for every character in the search text, which has a time complexity of O(N * M) where N is the length of the search text, and M is the length of the pattern. KMP search algorithm elegantly utilizes the previous matching result and never moves the search cursor back and also skipped all the previous matched character for the next comparison. The essense here is to take advantage of the character sequence that are the prefix of the pattern also being the suffix of the pattern. Because of this, we could skip the comparison of the substring of the pattern that are both prefix and suffix of it, this way, given M << N, for text matching where overlapping match may happen often, the times we need to reset the pattern search cursor is greatly reduced.

To have the information of the character sequence that are both prefix and suffix of the pattern, we need to preprocess the pattern string so that we know the longest prefix that is also the suffix for each character. For now, let's assume such sequence doesn't include the entire sequence itself, and we will come back to it in a moment with an example to explain why we shouldn't consider the entire sequence itself as the longest prefix that is also the suffix. Sometimes people call such sequence - LPS, which stands for the longest proper prefix that is also the suffix in a given sequence.

For example:

To preprocess the pattern onions,

  1. We first test for o, to see if it is both prefix and suffix, based on our assumption above, the entire sequence of o is not an LPS. so we mark it as 0;
  2. We test for on, to see what's the LPS in there. o is the prefix, and n is the suffix, they are not the same, so there isn't an LPS in them, so we mark it as 0;
  3. We test for oni, possible prefixes are o, on, and possible suffixes are i, ni, and none of them are the same, so there isn't an LPS in them, so we mark it as 0;
  4. We test for onio, possible prefixes are o, on, oni, and possible suffixes are o, io, nio, and o happens to be the only substring that is both a prefix and suffix, because its length is 1, so we mark it as 1;
  5. We test for onion, possible prefixes are o, on, oni, onio, and possible suffixes are n, on, ion, nion, and there are two substrings that are both prefix and suffix: o and on, we are looking for the longest substring, on wins, and its length is 2, so we marked it as 2;
  6. Finally, we test for onions, possible prefixes are o, on, oni, onio, onion and possible suffixes are s, ns, ons, ions, nions, and there isn't any substrings are the same. so we mark it as 0.

Now we have this preprocessed array that represent the longest prefix that is also the suffix for each sequence for this pattern:

Pattern

longest LPS length: 

LPS

Now, given a text onionions to search for the pattern, we got the matched

Steppers

We create two pointers, one for the pattern, and one for the search text:

Pattern

Search text

As we step over the search, when there is a match, both pointers move forward. When there is a mismatch, the search text pointer never resets, whereas the pattern pointer will look up the longest prefix that is also the suffix just before the mismatched character. Then, it will backtrack the pattern pointer to the LPS index of the previous matching substring sequence to skip all characters in such sequence that we know they have a match from the previous step. The intuition of how this logic works thanks to the fact that we know when the prefix of a sequence is exactly the same as its suffix, we could overlapping the matching substring of the sequence to skip the comparison.

With a little help from this stepper, we could aparently observe that if we considered the entire sequence of the pattern substring as the longest proper prefix that was also the suffix, the mismatched character would have been mistakenly counted as part of the matching character before we reset the pattern pointer back to its backtracked index (LPS index of the previous matching substring sequence).

Visualization

Once we have the array that indicates the starting index of each match, we could build the node sequence that represent the plain text and the highlighted text:

const nodes = [];
// i to track the start of the plain text
let i = 0;
// j to track the start of the highlight, in other words, the end of plain text
let j = 0;

while (i < children.length && j < matched.length) {
    const matchingStartingIndex = matched[j];
    const matchingEndingIndexExclusive = matched[j] + match.trim().length;
    // plain text range [i, j)
    const plainText = children.substring(i, matchingStartingIndex);
    // highlight text range [m[j], m[j] + len(pattern))
    const highlight = children.substring(matchingStartingIndex, matchingEndingIndexExclusive);
    nodes.push(plainText);
    nodes.push(<span class={style.textHighlight}>{highlight}</span>);
    // move i to the letter after j's matching pattern
    i = matchingEndingIndexExclusive;
    // moveup j to read the next matching starting index
    j++;
}

// tailing plain text
if(i < children.length) {
    nodes.push(children.substring(i));
}

Read on GitHub