Sunday, February 12, 2012

Building a time sensitive full-text search

I have spent a little time trying to build a time sensitive full-text search on a fairly large database (16M rows, with one of the full text columns an ntext column).

The goal for the search results is to have them time-sensitive - as a basic approach I am taking the relevance of the result, and dividing it by the number of days since the result was entered into the database. this works quite well, but I suspect that I can do something better.

The current approach returns the top 1000 most relevant results and then weights the relevance using the age of the result. It also limits the top 1000 to results that have occurred in the last X weeks.

I am only really interested in the top 100 or so of these weighted results.

As I increase the number of results to use as the starting point, the performance degrades quite quickly, and the results change - there are less relevant search results that actually come to the top of the weighted result.

So...the question is...is there a way to build the time-sensitivity into the relevance itself, or, any other suggestions as to how I could get good time sensitive results from full-text search without sacrificing performance.

thanks in advance

Nick, you've got quite a neat approach there. Unfortunately, the ranking algorithm in SQL Server is not customizable so it is not possible to integrate the value from another column into the ranking value. An alternative workaround would be to encode the date into a special string format which can later be used during full text search to limit the hits in a certain time range by using prefix search. This can reduce the performance degradation that you are experiencing.

Related to the performance, it isn’t clear from your message if you see the performance degrading when you increase the TOP n beyond 1000 or 100 rows?

Also, when you get the initial 1000 most relevant results, are the bottom rows in the initial ranking showing up in the top or bottom 10~20 in the final weighted 100 rows? If they’re showing up in the top 10~20 weighted rows then it could be that the time sensitivity is over weighted which may cause the less relevant results to bubble to the top after being weighted. Of course, you’d be the best judge of the relevance of the results.

I hope this helps.

Thanks

Sara

|||The degradation comes after about 1000 rows, although the FTS is set up with a good amount of RAM for caching, so it does depend a lot on whether the results are cached.

As to the weighting of the time sensitivity, we can get items from the bottom rows in the initial result set showing up at the top of the weighted result set. This is not an unwanted result, but the question of whether the weighting is "correct" is a valid one. I dont pretend to know the answer at present - it feels like a bit of a black art.|||Nick,
In addition to what Sara has asked for, could you post the full output of SELECT @.@.version as for other than SQL Server 2005, the OS platform is an important data point... Could you post your CONTAINSTABLE or FREETEXTABLE code? Do you see the performance degradation if you use the Top_N_Rank parameter?

Yes, this is somewhat of a black art Smile, but ther are other methods of doing this that do not involving joining other tables to get a "best bet" result.

Regards,
John
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/

No comments:

Post a Comment