PostgreSQL: An ultimate strategy for full text search

4

Full text search is one of the most powerful features in PostgreSQL. In this blog entry, we'll start with a typical text search problem in its simplest form and see how to implement its query under PostgreSQL. After that, we'll evolve the problem bit by bit and see how we can modify our implementation accordingly until we come up with an ultimate strategy for the text search problem in its most generic form.

In its simplest form, our text search problem would be selecting all users whose name matches "Andy Williams":

SELECT * FROM users
WHERE to_tsvector(users.name) @@ to_tsquery('andy | williams')

That was pretty easy. Now, what if we're searching for a query in more than one column, from one or more tables, e.g. select users whose users.name, or profiles.full_name matches "Andy Williams". In such a case, we'll have to use the concatenation operator ('|') to concatenate all the columns we'll search in. Notice that we are using 'coalesce' to pick null values, because concatenating anything with null returns null:

SELECT * FROM users LEFT JOIN profiles
ON users.id = profiles.user_id
WHERE to_tsvector(coalesce(users.name, '') | coalesce(profiles.full_name, '')) @@ to_tsquery('andy | williams')

Both of the previous queries will return some records in no specific order, which is not the case in a typical text search problem. A common requirement would be ordering the results by relevance, i.e. how relevant is a record to the given search query. PostgreSQL offers a function, ts_rank_cd, which evaluates how relevant a vector is to a query.

SELECT users.*, profiles.*, ts_rank_cd(to_tsvector(coalesce(users.name, '') | coalesce(profiles.full_name, '')), to_tsquery('andy | williams')) as rank
FROM users LEFT JOIN profiles
ON users.id = profiles.user_id
WHERE to_tsvector(coalesce(users.name, '') | coalesce(profiles.full_name, '')) @@ to_tsquery('andy | williams')
ORDER BY rank DESC

Now, what if we are looking for "Andy Williams" in both users and their dependants (one to many relation), where joining will yield repeated records? In the simple case, where no relevance order is needed, we'll just eliminate repititions using DISTINCT. It's that simple because we don't care how many times a single record occures in the results, as we're not interested in its relevance:

SELECT DISTINCT * FROM users LEFT JOIN dependants
ON users.id = dependants.user_id
WHERE to_tsvector(coalesce(users.name, '') | coalesce(dependandts.name, '')) @@ to_tsquery('andy | williams')

Now let's look at the more complex, more realistic, most generic case. We need to return users records whose names or dependats' names match "Andy Williams", returning each users' result only once, with the most relevant record first. Simple distinction in this case is semantically wrong because a user with two "andy williams" dependants is more relevant than a user with only one.

An approach to solve this problem is to select a new aggregated column with the concatenation of dependants' names, including it in the search, and grouping results by user. The problem about this is that there is no pre-defined string-concatenation aggregate function. PostgreSQL offers a way to define custom aggregate functions so we can define our own concatenation function. However, we'll also face the group-by limitation of PostgreSQL - that is, all selected columns must appear in the group-by clause, which pretty much complicates the query.

This brings us to the ultimate strategy, which serves all the mentioned requirements and eliminates the mentioned problems. The solution suggests that all searchable columns are automatically aggregated in a new system-maintained column by a database trigger or an application level callback. Then our task would be as simple as searching that extra column for our query. In our example, an application callback could be used to watch over changes done to user (insert, update) and dependants (insert, update, delete). The callback's task is to re-calculate the 'textsearch' column that contains the concatenation of the user name and the names of all his dependants. Then the text-search query will be as simple as follows:

SELECT users.*, ts_rank_cd(to_tsvector(users.textsearch), to_tsquery('andy | williams')) as rank
FROM users
WHERE to_tsvector(users.textsearch) @@ to_tsquery('andy | williams')
ORDER BY rank DESC

Notice that an aggregation overhead is added to the update operations, while the search operations are now releaved from any joins. This is what makes this approach preferable, because select operations are much more frequent than update operations in typical applications. This results in an overall performance boost.

Written By:

Haitham Mohammad (e-haitham.blogspot.com)

Comments

1

What do you think of Lucene full text search engine compared to PostgreSQL's full text search.

2

@denis: Sure Lucene is an excelent solution for text search, but here is the trick. If you're gonna use Lucene, you have to be aware of taking the overhead of running a JVM instance for search server, and the overhead of blocking requests to this search server (like Solr for instance).

I know you can integrate Lucene directly in your application and avoid using a search server, but this applies only if your application is written in Java. All this overhead is not present in case of DB text search.

But to be honest, generally this overhead is worth taking only in case of searching another locale than english. PostgreSql TS is not that powerful when it comes to localization, on contrary to Lucene.

3

How to make the PostgreSQL's full text search handle spelling mistakes in the query string?

4

AFAIK, this can't be done in DB level. you can use some client-side spell checker.

eSpace podcast Prodcast

RSS iTunes