There were a few people curious in the ST forums on how Sublime Text’s fuzzy matching works. Note that this is a re-post of my 2013 blog post which is no longer available.
In reality, I don’t know either. There were some interesting solutions given, but I haven’t researched their methods or workability.
I needed to implement this for a small project I’m working on. I went with a solution that worked decently (I think I saw this within a JS fuzzy searching library). I’m not claiming it to be the best or the fastest (far from it, actually).
However, just in case you need to implement something like this, this might come in handy.
The FuzzyMatcher Class
class FuzzyMatcher(): def __init__(self): self.pattern = '' def setPattern(self, pattern): self.pattern = '.*?'.join(map(re.escape, list(pattern))) def score(self, string): match = re.search(self.pattern, string) if match is None: return 0 else: return 100.0 / ((1 + match.start()) * (match.end() - match.start() + 1))
There are two methods, one sets the pattern to match against, and the other is used to calculate the score of the candidates.
You would use them as:
fuzzyMatcher = FuzzyMatcher() fuzzyMatcher.setPattern('aad') # Set a pattern to match aganinst score1 = fuzzyMatcher.score('This is string one') # Score for this string against the pattern score2 = fuzzyMatcher.score('This is string two') # And so on...
First, let’s understand the principles we’re working with:
- The pattern would be few characters
- A match would mean that the characters in the pattern appear in the same order as in the matched string
- A match found near the beginning of a string is scored more than a match found near the end
- A match is scored more if the characters in the patterns are closer to each other, while the score is lower if they are more spread out
- We are not assigning more weights to certain characters than the other
That’s the criterias we’re going with. As you can see, we can construct a regex to see if a pattern appears in a string.
If the pattern is
aad, our regex would look like
a.*?a.*?d. We’re looking for the shortest substring that has the characters
aad in that order, with arbitrary characters in between them.
We can easily figure out the position of the match, as well as the length of the match. If the length of the match is larger, it automatically implies that the characters are more spread out.
Explanation of the Python Program
Let’s look at the
setPattern function that creates the regex. We have a separate function to create the regex for clarity.
def setPattern(self, pattern): self.pattern = '.*?'.join(map(re.escape, list(pattern)))
We first convert the string into a character array using
list(pattern), and then map
re.escape to the characters. Some characters have special meaning in regex, but we want the user’s input to be treated as a literal.
Then, we join the escaped characters using
.*?. This would turn something like
Now, let’s look at the
def score(self, string): match = re.search(self.pattern, string) if match is None: return 0 else: return 100.0 / ((1 + match.start()) * (match.end() - match.start() + 1))
Here, we’re doing a regex search for the input string, against the pattern we previously set. If we have no match, return 0 as the score.
- The score is inversely proportional to the start of the match. We add a 1 since strings are zero-indexed, and we don’t want a division by 0.
- The score is inversely proportional to the length of the match.
- We scale the results by multiplying it by
100.0. This value is arbitrary. Since the string lengths I dealt with were small, this worked. You may have to adjust this.
That was a simple (possibly inefficient) solution that works well. Of course, it can be optimized by pre-compiling the regex and stuff like that. I left it out just for clarity.
And this is in no way restricted to Python. Any language with a decent regex implementation would work.
Also, the score calculation can be improved by assigning more weight to characters appearing closer together than the match being farther away in a string. I’m researching on that as we speak.