Regex Positive Lookahead
When search for pattern matching, it is possible to have overlapping characters that could be a potential match. For example, in the
Pattern
Search text
A regex like this /onion/gi
will only result in two matches:
We could tell there is an overlapping occurrence missing from these matches, to match such occurences, we need to ask regex to look ahead of each character before moving on to the next character, similarly to the brute force approach from the next section. This technique is called Positive Lookahead in regex, and the expression is (?=...)
. It asserts that the given subpattern can be matched here, without consuming characters. Now we update the regex to be /(?=onion)/
, we will be getting three matches including the overlapping one.
Brute force pattern matching
The brute force approach is to examine whether the following sequence matches the pattern characters for each character in the search text.
Knuth-Morris-Pratt text search algorithm
The above brute force algorithm in the worst case scenario, will run a comparison as long as the pattern string for every character in the search text, which has a time complexity of O(N * M)
where N
is the length of the search text, and M
is the length of the pattern. KMP search algorithm elegantly utilizes the previous matching result and never moves the search cursor back and also skipped all the previous matched character for the next comparison. The essense here is to take advantage of the character sequence that are the prefix of the pattern also being the suffix of the pattern. Because of this, we could skip the comparison of the substring of the pattern that are both prefix and suffix of it, this way, given M
<< N
, for text matching where overlapping match may happen often, the times we need to reset the pattern search cursor is greatly reduced.
To have the information of the character sequence that are both prefix and suffix of the pattern, we need to preprocess the pattern string so that we know the longest prefix that is also the suffix for each character. For now, let's assume such sequence doesn't include the entire sequence itself, and we will come back to it in a moment with an example to explain why we shouldn't consider the entire sequence itself as the longest prefix that is also the suffix. Sometimes people call such sequence - LPS
, which stands for the longest proper prefix that is also the suffix in a given sequence.
For example:
To preprocess the pattern onions
,
- We first test for
o
, to see if it is both prefix and suffix, based on our assumption above, the entire sequence ofo
is not anLPS
. so we mark it as0
; - We test for
on
, to see what's the LPS in there.o
is the prefix, andn
is the suffix, they are not the same, so there isn't anLPS
in them, so we mark it as0
; - We test for
oni
, possible prefixes areo
,on
, and possible suffixes arei
,ni
, and none of them are the same, so there isn't anLPS
in them, so we mark it as0
; - We test for
onio
, possible prefixes areo
,on
,oni
, and possible suffixes areo
,io
,nio
, ando
happens to be the only substring that is both a prefix and suffix, because its length is 1, so we mark it as1
; - We test for
onion
, possible prefixes areo
,on
,oni
,onio
, and possible suffixes aren
,on
,ion
,nion
, and there are two substrings that are both prefix and suffix:o
andon
, we are looking for the longest substring,on
wins, and its length is 2, so we marked it as2
; - Finally, we test for
onions
, possible prefixes areo
,on
,oni
,onio
,onion
and possible suffixes ares
,ns
,ons
,ions
,nions
, and there isn't any substrings are the same. so we mark it as0
.
Now we have this preprocessed array that represent the longest prefix that is also the suffix for each sequence for this pattern:
Pattern
longest LPS length:
LPS
Now, given a text onionions
to search for the pattern, we got the matched
Steppers
We create two pointers, one for the pattern, and one for the search text:
Pattern
Search text
As we step over the search, when there is a match, both pointers move forward. When there is a mismatch, the search text pointer never resets, whereas the pattern pointer will look up the longest prefix that is also the suffix just before the mismatched character. Then, it will backtrack the pattern pointer to the LPS
index of the previous matching substring sequence to skip all characters in such sequence that we know they have a match from the previous step. The intuition of how this logic works thanks to the fact that we know when the prefix of a sequence is exactly the same as its suffix, we could overlapping the matching substring of the sequence to skip the comparison.
With a little help from this stepper, we could aparently observe that if we considered the entire sequence of the pattern substring as the longest proper prefix that was also the suffix, the mismatched character would have been mistakenly counted as part of the matching character before we reset the pattern pointer back to its backtracked index (LPS
index of the previous matching substring sequence).
Visualization
Once we have the array that indicates the starting index of each match, we could build the node sequence that represent the plain text and the highlighted text:
const nodes = []; // i to track the start of the plain text let i = 0; // j to track the start of the highlight, in other words, the end of plain text let j = 0; while (i < children.length && j < matched.length) { const matchingStartingIndex = matched[j]; const matchingEndingIndexExclusive = matched[j] + match.trim().length; // plain text range [i, j) const plainText = children.substring(i, matchingStartingIndex); // highlight text range [m[j], m[j] + len(pattern)) const highlight = children.substring(matchingStartingIndex, matchingEndingIndexExclusive); nodes.push(plainText); nodes.push(<span class={style.textHighlight}>{highlight}</span>); // move i to the letter after j's matching pattern i = matchingEndingIndexExclusive; // moveup j to read the next matching starting index j++; } // tailing plain text if(i < children.length) { nodes.push(children.substring(i)); }