PostgreSQL: An ultimate strategy for full text search
4
Full text search is one of the most powerful features in PostgreSQL. In this blog entry, we'll start with a typical text search problem in its simplest form and see how to implement its query under PostgreSQL. After that, we'll evolve the problem bit by bit and see how we can modify our implementation accordingly until we come up with an ultimate strategy for the text search problem in its most generic form.
In its simplest form, our text search problem would be selecting all users whose name matches "Andy Williams":
SELECT * FROM users
WHERE to_tsvector(users.name) @@ to_tsquery('andy | williams')
That was pretty easy. Now, what if we're searching for a query in more than one column, from one or more tables, e.g. select users whose users.name, or profiles.full_name matches "Andy Williams". In such a case, we'll have to use the concatenation operator ('|') to concatenate all the columns we'll search in. Notice that we are using 'coalesce' to pick null values, because concatenating anything with null returns null:
SELECT * FROM users LEFT JOIN profiles
ON users.id = profiles.user_id
WHERE to_tsvector(coalesce(users.name, '') | coalesce(profiles.full_name, '')) @@ to_tsquery('andy | williams')
Both of the previous queries will return some records in no specific order, which is not the case in a typical text search problem. A common requirement would be ordering the results by relevance, i.e. how relevant is a record to the given search query. PostgreSQL offers a function, ts_rank_cd, which evaluates how relevant a vector is to a query.
SELECT users.*, profiles.*, ts_rank_cd(to_tsvector(coalesce(users.name, '') | coalesce(profiles.full_name, '')), to_tsquery('andy | williams')) as rank
FROM users LEFT JOIN profiles
ON users.id = profiles.user_id
WHERE to_tsvector(coalesce(users.name, '') | coalesce(profiles.full_name, '')) @@ to_tsquery('andy | williams')
ORDER BY rank DESC
Now, what if we are looking for "Andy Williams" in both users and their dependants (one to many relation), where joining will yield repeated records? In the simple case, where no relevance order is needed, we'll just eliminate repititions using DISTINCT. It's that simple because we don't care how many times a single record occures in the results, as we're not interested in its relevance:
SELECT DISTINCT * FROM users LEFT JOIN dependants
ON users.id = dependants.user_id
WHERE to_tsvector(coalesce(users.name, '') | coalesce(dependandts.name, '')) @@ to_tsquery('andy | williams')
Now let's look at the more complex, more realistic, most generic case. We need to return users records whose names or dependats' names match "Andy Williams", returning each users' result only once, with the most relevant record first. Simple distinction in this case is semantically wrong because a user with two "andy williams" dependants is more relevant than a user with only one.
An approach to solve this problem is to select a new aggregated column with the concatenation of dependants' names, including it in the search, and grouping results by user. The problem about this is that there is no pre-defined string-concatenation aggregate function. PostgreSQL offers a way to define custom aggregate functions so we can define our own concatenation function. However, we'll also face the group-by limitation of PostgreSQL - that is, all selected columns must appear in the group-by clause, which pretty much complicates the query.
This brings us to the ultimate strategy, which serves all the mentioned requirements and eliminates the mentioned problems. The solution suggests that all searchable columns are automatically aggregated in a new system-maintained column by a database trigger or an application level callback. Then our task would be as simple as searching that extra column for our query. In our example, an application callback could be used to watch over changes done to user (insert, update) and dependants (insert, update, delete). The callback's task is to re-calculate the 'textsearch' column that contains the concatenation of the user name and the names of all his dependants. Then the text-search query will be as simple as follows:
SELECT users.*, ts_rank_cd(to_tsvector(users.textsearch), to_tsquery('andy | williams')) as rank
FROM users
WHERE to_tsvector(users.textsearch) @@ to_tsquery('andy | williams')
ORDER BY rank DESC
Notice that an aggregation overhead is added to the update operations, while the search operations are now releaved from any joins. This is what makes this approach preferable, because select operations are much more frequent than update operations in typical applications. This results in an overall performance boost.
Written By:
Haitham Mohammad (e-haitham.blogspot.com)
Comments
eSpace podcast Prodcast
Archive
- September 2011
- April 2011
- March 2011
- December 2010
- November 2010
- September 2010
- August 2010
- July 2010
- June 2010
- April 2010
- March 2010
- November 2009
- October 2009
- September 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- November 2008
- October 2008
- September 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- January 2008
- April 2007
- March 2007
Latest Comments
- SpectraMind Commented on Egypt Wins UK's National Outsourcing Association Award
- Rofaida Awad Commented on Go Egypt Go!
- Different Mike Commented on Only idiots change their iPhone root password!
- Mike Commented on Only idiots change their iPhone root password!
- smile Commented on Only idiots change their iPhone root password!

