When searching for text snippets in sql databases you might want to rank the results according to “how good did it match”. And: the ranking shouldn’t make the query slower.
Let’s take a simple example using the LIKE operator. (I know, FTS does a better job, but let’s stick to like for now).
Assume the search expression ‘a bc de’ and a table ‘my_table’ with text columns ‘title’ and ‘description’.
We want to find all rows with ‘title’ matching all three blank-separated parts of the search term:
SELECT rowid, title FROM my_table WHERE (title LIKE '%a%' AND title LIKE '%bc%' AND title LIKE '%de%')
To sort them, we apply a bonus for parts matching the column start:
SELECT rowid, title, -- column start bonus LIKE('a%', title) + LIKE('bc%', title) + LIKE('de%', title) + 0 AS bonus FROM my_table WHERE ((title LIKE '%a%') AND (title LIKE '%bc%') AND (title LIKE '%de%')) ORDER BY bonus DESC, title ASC, rowid ASC
Next, we’d like to add a (somewhat smaller) bonus for word-starts:
SELECT rowid, title, -- column start bonus LIKE('a%', title) * 2 + LIKE('bc%', title) * 2 + LIKE('de%', title) * 2 + -- word start bonus LIKE('% a%', title) * 1 + LIKE('% bc%', title) * 1 + LIKE('% de%', title) * 1 + 0 AS bonus FROM my_table WHERE ((title LIKE '%a%') AND (title LIKE '%bc%') AND (title LIKE '%de%')) ORDER BY bonus DESC, title ASC, rowid ASC
Rows matching the three terms in order get an even bigger bonus:
SELECT rowid, title, -- correct order bonus LIKE('%a%bc%de%', title) * 5 * 3 + -- column start bonus LIKE('a%', title) * 2 + LIKE('bc%', title) * 2 + LIKE('de%', title) * 2 + -- word start bonus LIKE('% a%', title) * 1 + LIKE('% bc%', title) * 1 + LIKE('% de%', title) * 1 + 0 AS bonus FROM my_table WHERE ((title LIKE '%a%') AND (title LIKE '%bc%') AND (title LIKE '%de%')) ORDER BY bonus DESC, title ASC, rowid ASC
And finally adding the match on ‘description’ secondary:
SELECT rowid, title, description, -- title is primary match: -- correct order bonus LIKE('%a%bc%de%', title) * 50 * 3 + -- column start bonus LIKE('a%', title) * 20 + LIKE('bc%', title) * 20 + LIKE('de%', title) * 20 + -- word start bonus LIKE('% a%', title) * 10 + LIKE('% bc%', title) * 10 + LIKE('% de%', title) * 10 + -- description is secondary match: -- correct order bonus LIKE('%a%bc%de%', description) * 5 * 3 + -- column start bonus LIKE('a%', description) * 2 + LIKE('bc%', description) * 2 + LIKE('de%', description) * 2 + -- word start bonus LIKE('% a%', description) * 1 + LIKE('% bc%', description) * 1 + LIKE('% de%', description) * 1 + 0 AS bonus FROM my_table WHERE ((title LIKE '%a%') AND (title LIKE '%bc%') AND (title LIKE '%de%')) OR ((description LIKE '%a%') AND (description LIKE '%bc%') AND (description LIKE '%de%')) ORDER BY bonus DESC, title ASC, description ASC, rowid ASC
You get the idea.
Funny thing is – the whole ranking logic doesn’t hit performance (at least for small texts in the two columns)!
So, key is:
- scan the table only once to find match candidates using the LIKE operator,
- use the LIKE function(!) plus weighting-factors to compute a bonus for each hit,
- evtl. add secondary matching columns.
P.S.: This post was inspired by a chat with Deesa on the way home riding False Creek Ferry.
Post a Comment