Full-text searching with MySQL
MySQL’s full-text search functions provide a simple framework for an easily implemented, approximate site search. Many sites, written in an interpreted language and powered by MySQL, can use MySQL’s full-text search to avoid third party dependencies.
The basics
The basics of the MySQL full-text search functions are well-documented in the MySQL online documentation. For those lacking patience, here is a quick rundown.
Full-text searching is somewhat akin to a LIKE
condition, but is much faster, requiring a FULLTEXT
index to be created for the table columns targeted in the search. To search the title
and description
columns of a table, entries
, the following statement would create the proper index:
ALTER TABLE entries ADD FULLTEXT(title, description)
To search these columns for the text, “python threading,” the MATCH...AGAINST
functions are used:
SELECT id, MATCH(title, description) AGAINST ('python threading') AS score FROM entries ORDER BY score DESC
Notice that we keep the result of the match. The value returned is a float representing the relevance of the match. The higher the number, the more relevant the match.
There are several caveats to the full-text search. In particular, any words that are common between many entries are treated as noise and their relevance in any search is diminished. This means that were every article in entries
to be about threading in Python, searching for “python threading” may not return extremely relevant results. Refer to the MySQL docs for more information.
The hard part
If the content to be searched is not conveniently located in one table, things get more complex. In this case, a method must be devised to create an intermediary table to contain the search target.
This might be accomplished with a cron script that aggregates the information nightly or using stored procedures to keep the target table updated.
Refining results
A common case is to weight the search to favor more recent results. Assuming that each entry has a DATETIME
field named timestamp
, this is easily accomplished by using the entry’s age to modify the score.
For an even reduction to the score based on the article’s age, divide the score by the age, which is determined with DATEDIFF(NOW(), timestamp)
.
(MATCH(title, description) AGAINST ('python threading'))/GREATEST(1, DATEDIFF(NOW(), timestamp))
Since DATEDIFF
returns the difference in days, an entry written today could cause division by zero.
GREATEST
means that entries written today and yesterday have equal weight, but prevents results from omitting today’s articles.
A quick test of this will show that results become wildly incorrect after a few days as the text match score begins to diminish further with age. This effect can be reduced by taking the LOG
of the age, making the divisor increase less and less the greater the age.
LOG(GREATEST(1, DATEDIFF(NOW(), timestamp)))
The use of LOG
causes a steep drop initially, smothing over time. For a less dramatic effect, substituting the square root causes a similar drop in the weight of the entry’s age over time, but diminishing less starkly over time and without the initial steep drop.
SQRT(GREATEST(1, DATEDIFF(NOW(), timestamp)))
The complete SQL statement is now:
SELECT id, (MATCH(title, description) AGAINST ('python threading'))/SQRT(GREATEST(1, DATEDIFF(NOW(), timestamp))) AS score FROM entries ORDER BY score DESC
Great Article ! In just few lines you have explained exactly what I wanted. It helped me to kick start with FULL TEXT Search within just few minutes.
Thank You so much.
Is MySQL full-text search really scalable?