There were a few people curious in the ST forums on how Sublime Text’s fuzzy matching works. Note that this is a re-post of my 2013 blog post which is no longer available.

SublimeText's fuzzy search in action

In reality, I don’t know either. There were some interesting  solutions given, but I haven’t researched their methods or workability.

I needed to implement this for a small project I’m working on. I went  with a solution that worked decently (I think I saw this within a JS  fuzzy searching library). I’m not claiming it to be the best or the  fastest (far from it, actually).

However, just in case you need to implement something like this, this might come in handy.

The FuzzyMatcher Class

class FuzzyMatcher():

    def __init__(self):
        self.pattern = ''

    def setPattern(self, pattern):
        self.pattern = '.*?'.join(map(re.escape, list(pattern)))

    def score(self, string):
        match = re.search(self.pattern, string)
        if match is None:
            return 0
        else:
            return 100.0 / ((1 + match.start()) * (match.end() - match.start() + 1))

There are two methods, one sets the pattern to match against, and the other is used to calculate the score of the candidates.

You would use them as:

fuzzyMatcher = FuzzyMatcher()
fuzzyMatcher.setPattern('aad') # Set a pattern to match aganinst
score1 = fuzzyMatcher.score('This is string one') # Score for this string against the pattern
score2 = fuzzyMatcher.score('This is string two') # And so on...

The Idea

First, let’s understand the principles we’re working with:

  1. The pattern would be few characters
  2. A match would mean that the characters in the pattern appear in the same order as in the matched string
  3. A match found near the beginning of a string is scored more than a match found near the end
  4. A match is scored more if the characters in the patterns are closer  to each other, while the score is lower if they are more spread out
  5. We are not assigning more weights to certain characters than the other

That’s the criterias we’re going with. As you can see, we can construct a regex to see if a pattern appears in a string.

If the pattern is aad, our regex would look like a.*?a.*?d. We’re looking for the shortest substring that has the characters aad in that order, with arbitrary characters in between them.

We can easily figure out the position of the match, as well as the  length of the match. If the length of the match is larger, it  automatically implies that the characters are more spread out.

Explanation of the Python Program

Let’s look at the setPattern function that creates the regex. We have a separate function to create the regex for clarity.

def setPattern(self, pattern):
    self.pattern = '.*?'.join(map(re.escape, list(pattern)))

We first convert the string into a character array using list(pattern), and then map re.escape to the characters. Some characters have special meaning in regex, but we want the user’s input to be treated as a literal.

Then, we join the escaped characters using .*?. This would turn something like aad into a.*?a.*?d.

Now, let’s look at the score function.

def score(self, string):
    match = re.search(self.pattern, string)
    if match is None:
        return 0
    else:
        return 100.0 / ((1 + match.start()) * (match.end() - match.start() + 1))

Here, we’re doing a regex search for the input string, against the  pattern we previously set. If we have no match, return 0 as the score.

Else:

  1. The score is inversely proportional to the start of the match. We  add a 1 since strings are zero-indexed, and we don’t want a division by  0.
  2. The score is inversely proportional to the length of the match.
  3. We scale the results by multiplying it by 100.0. This value is arbitrary. Since the string lengths I dealt with were small, this worked. You may have to adjust this.

Conclusion

That was a simple (possibly inefficient) solution that works well. Of  course, it can be optimized by pre-compiling the regex and stuff like  that. I left it out just for clarity.

And this is in no way restricted to Python. Any language with a decent regex implementation would work.

Also, the score calculation can be improved by assigning more weight  to characters appearing closer together than the match being farther  away in a string. I’m researching on that as we speak.