tsearch2 for Rails applications

Having finally managed to squirrel out some free time to work on Cookmarking. I really wanted to add a bit of functionality that we were lacking in the initial implementation phase, one of the things we reckoned was really important was the ability to search bookmarks.

The way you implement a search in a Rails application can vary, from really simple ilike / like implementations such as those given by Searchlogic to something a little more sophisticated such as Thinking Sphinx or Acts As Ferret.

There are other contenders to search which are largely ignored by the Rails community – of which I reckon one of the more stronger ones is tsearch2. The advantage that tsearch2 has over Searchlogic is that much like Thinking Sphinx and Acts As Ferret is that it is a full text search, the main advantage being performance. There are other points to consider such as the inclusion of stop words (e.g. to ignore the word “the” in searches) as well as being able to rank results and only having to maintain a single index (the amalgamation of all your fields for search).

So, why tsearch2 over Thinking Sphinx and Acts As Ferret? the main thing being that it is built into Postgres and does not require an external daemon or any other gems to make it work. It will work right out of the box with Postgres 8.3 and without having to install pl/pgsql in Postgres 9.0 and is blazingly fast when returning a small-medium resultset (150k records).

Initial preparation

Preparation is simple – if you are working on a new application, its as simple as having a column of type tsvector in the table for which you want to search as well as a gin index, however, if you are adding to an existing application – you will need to do the following:

Create tsvectors for existing records

The best way to create tsvectors for existing content is to write a function:

-- needed for existing and new content

alter table bookmarks add column tsv tsvector;
create index bookmarks_tsv_idx on bookmarks using gin(tsv);

-- needed only for existing content

create or replace function refresh_bookmarks_tsv() 
returns void as $$
begin
  update bookmarks
  set tsv = q.weighted_tsv
  from
    (select
     bookmarks.id,
     setweight(to_tsvector(coalesce(bookmarks.title, '')), 'A')
     || setweight(to_tsvector(coalesce(bookmarks.description, '')), 'C')
     || setweight(to_tsvector(coalesce(string_agg(tags.name, ' '), '')), 'B') 
          as weighted_tsv
     from bookmarks
     left join user_bookmarks on user_bookmarks.bookmark_id = bookmarks.id
     left join taggings on (taggings.taggable_type = 'UserBookmark' 
                        and taggings.taggable_id = user_bookmarks.bookmark_id)
     left join tags on taggings.tag_id = tags.id
     group by bookmarks.id, bookmarks.title, bookmarks.description) q
  where q.id = bookmarks.id;
end
$$ language plpgsql;

select refresh_bookmarks_tsv();

Note here, I have used acts_as_taggable_on for my tags. Lets have a rundown of what exactly this does

string_agg

string_agg is used to turn an aggregated set of results into a string, so for example if the tags for a bookmark were chicken, garlic, bbq – you would get a result from string_agg of “chicken garlic bbq”, which is perfect for the vector conversion

setweight

Setweight is used here for ranking later on, so that in this the title is the most important thing, followed by, the tags then followed by the description

to_tsvector

Very simple, turns a string into a tsvector

Write a function for new records to add a tsv

create or replace function bookmarks_update_tsv(integer) returns void as $$
begin
  update bookmarks
  set tsv = q.weighted_tsv
  from
    (select
     bookmarks.id,
     setweight(to_tsvector(coalesce(bookmarks.title, '')), 'A')
     || setweight(to_tsvector(coalesce(bookmarks.description, '')), 'C')
     || setweight(to_tsvector(coalesce(string_agg(tags.name, ' '), '')), 'B') 
        as weighted_tsv
     from bookmarks
     left join user_bookmarks on user_bookmarks.bookmark_id = bookmarks.id
     left join taggings on (taggings.taggable_type = 'UserBookmark' 
                        and taggings.taggable_id = user_bookmarks.bookmark_id)
     left join tags on taggings.tag_id = tags.id
     group by bookmarks.id, bookmarks.title, bookmarks.description) q
  where q.id = bookmarks.id and bookmarks.id = $1;
end
$$ language plpgsql;

This is very similar to the function above, with the difference that it takes a parameter – that being an id and then has a where clause of that id. This function will be useful for triggers

Triggers on tables where insertion / updates / deletion matters

Firstly, we need to look at the shape of the application function and decide when we want the function applied, so most likely it will need to run on insertion of a bookmark and also on the update of a bookmark

create or replace function bookmarks_tsv_trigger() returns trigger as $$
begin
  if (tg_op = 'INSERT') then
    perform bookmarks_update_tsv(new.id);
  elsif (tg_op = 'UPDATE') then
    if row(new) is distinct from row(old) then
      perform bookmarks_update_tsv(new.id);
    end if;
  end if;
  return null;
end;
$$ language plpgsql;

create trigger bookmarks_tsv_tr after insert or update on bookmarks
  for each row execute procedure bookmarks_tsv_trigger();

Finally, we want the function also to run on either the insertion / update or deletion of a tagging

create or replace function taggings_tsv_trigger() returns trigger as $$
declare
  rec record;
begin
  if (tg_op = 'INSERT') then
    if (new.taggable_type = 'UserBookmark') then
      select * into rec from user_bookmarks 
        where user_bookmarks.id = new.taggable_id;
      perform bookmarks_update_tsv(rec.bookmark_id);
    end if;
  elsif (tg_op = 'UPDATE') then
    if (new.taggable_type = 'UserBookmark') then
      select * into rec from user_bookmarks 
        where user_bookmarks.id = new.taggable_id;
      perform bookmarks_update_tsv(rec.bookmark_id);
    end if;
  elsif (tg_op = 'DELETE') then
    if (old.taggable_type = 'UserBookmark') then
      select * into rec from user_bookmarks 
        where user_bookmarks.id = old.taggable_id;
      perform bookmarks_update_tsv(rec.bookmark_id);
    end if;
  end if;
  return null;
end;
$$ language plpgsql;


create trigger taggings_tsv_tr after insert or update or delete on taggings
  for each row execute procedure taggings_tsv_trigger();

Migration

All of this needs to go inside a migration – for all migrations I use an execute – since I am more comfortable writing SQL rather than the ActiveRecord migration DSL. The up looks like:

  def self.up
    execute <<EOF

  SQL HERE

EOF
  end

I also tend not to write downs, since I normally use a production dump for development, besides that – migrations are run inside a transaction – so, if anything fails, it just needs to be re-run.

Usage

In order to use tsearch2, its as simple as any other query:

scope :search, lambda { |search|
  if search.blank?
    select("bookmarks.*,
            coalesce(avg(user_bookmarks.rating)::int, 0) as average_rating, 
            coalesce(count(user_bookmarks.review), 0) as number_of_reviews")
  else
    query = sanitize_sql_array ["plainto_tsquery(?)", search]
    select("bookmarks.*,
            coalesce(avg(user_bookmarks.rating)::int, 0) as average_rating, 
            coalesce(count(user_bookmarks.review), 0) as number_of_reviews,
            ts_rank_cd(tsv, #{query}, 32) as rank").where("tsv @@ #{query}")
  end.joins("left join user_bookmarks on user_bookmarks.bookmark_id = bookmarks.id").
      group(columns.collect {|s| "bookmarks.#{s.name}"}.join(", "))
}

And there you have it – simple, no dependancy search powered by SQL.

“The way you implement a search in a Rails application can vary, from really simple ilike / like implementations such as those given by Searchlogic to something a little more sophisticated such as Thinking Sphinx or Acts As Ferret”