Rank sql(ite) text search results

Fri, 05. Aug 2011

Categories: en development Tags: iOS iPhone like Search SQL SQLite

When searching for text snippets in sql databases you might want to rank the results according to „how good did it match“. And: the ranking shouldn’t make the query slower.

Let’s take a simple example using the LIKE operator. (I know, FTS does a better job, but let’s stick to like for now).

Assume the search expression ‘a bc de’ and a table ‘my_table’ with text columns ‘title’ and ‘description’.

We want to find all rows with ‘title’ matching all three blank-separated parts of the search term:

1SELECT rowid, title
2FROM my_table
3WHERE (title LIKE '%a%' AND title LIKE '%bc%' AND title LIKE '%de%')

To sort them, we apply a bonus for parts matching the column start:

1SELECT rowid, title,
2    -- column start bonus
3    LIKE('a%', title) +
4    LIKE('bc%', title) +
5    LIKE('de%', title) +
60 AS bonus
7FROM my_table
8WHERE ((title LIKE '%a%') AND (title LIKE '%bc%') AND (title LIKE '%de%'))
9ORDER BY bonus DESC, title ASC, rowid ASC

Next, we’d like to add a (somewhat smaller) bonus for word-starts:

 1SELECT rowid, title,
 2    -- column start bonus
 3    LIKE('a%', title) * 2 +
 4    LIKE('bc%', title) * 2 +
 5    LIKE('de%', title) * 2 +
 6    -- word start bonus
 7    LIKE('% a%', title) * 1 +
 8    LIKE('% bc%', title) * 1 +
 9    LIKE('% de%', title) * 1 +
100 AS bonus
11FROM my_table
12WHERE ((title LIKE '%a%') AND (title LIKE '%bc%') AND (title LIKE '%de%'))
13ORDER BY bonus DESC, title ASC, rowid ASC

Rows matching the three terms in order get an even bigger bonus:

 1SELECT rowid, title,
 2    -- correct order bonus
 3    LIKE('%a%bc%de%', title) * 5 * 3 +
 4    -- column start bonus
 5    LIKE('a%', title) * 2 +
 6    LIKE('bc%', title) * 2 +
 7    LIKE('de%', title) * 2 +
 8    -- word start bonus
 9    LIKE('% a%', title) * 1 +
10    LIKE('% bc%', title) * 1 +
11    LIKE('% de%', title) * 1 +
120 AS bonus
13FROM my_table
14WHERE ((title LIKE '%a%') AND (title LIKE '%bc%') AND (title LIKE '%de%'))
15ORDER BY bonus DESC, title ASC, rowid ASC

And finally adding the match on ‘description’ secondary:

 1SELECT rowid, title, description,
 2    -- title is primary match:
 3    -- correct order bonus
 4    LIKE('%a%bc%de%', title) * 50 * 3 +
 5    -- column start bonus
 6    LIKE('a%', title) * 20 +
 7    LIKE('bc%', title) * 20 +
 8    LIKE('de%', title) * 20 +
 9    -- word start bonus
10    LIKE('% a%', title) * 10 +
11    LIKE('% bc%', title) * 10 +
12    LIKE('% de%', title) * 10 +
13    -- description is secondary match:
14    -- correct order bonus
15    LIKE('%a%bc%de%', description) * 5 * 3 +
16    -- column start bonus
17    LIKE('a%', description) * 2 +
18    LIKE('bc%', description) * 2 +
19    LIKE('de%', description) * 2 +
20    -- word start bonus
21    LIKE('% a%', description) * 1 +
22    LIKE('% bc%', description) * 1 +
23    LIKE('% de%', description) * 1 +
240 AS bonus
25FROM my_table
26WHERE ((title       LIKE '%a%') AND (title       LIKE '%bc%') AND (title       LIKE '%de%'))
27OR    ((description LIKE '%a%') AND (description LIKE '%bc%') AND (description LIKE '%de%'))
28ORDER BY bonus DESC, title ASC, description ASC, rowid ASC

You get the idea.

Funny thing is โ€“ the whole ranking logic doesn’t hit performance (at least for small texts in the two columns)!

So, key is:

  1. scan the table only once to find match candidates using the LIKE operator,
  2. use the LIKE function(!) plus weighting-factors to compute a bonus for each hit,
  3. evtl. add secondary matching columns.

P.S.: This post was inspired by a chat with Deesa on the way home riding False Creek Ferry.