Discussion, code samples and video demos of new technologies; including Web 2.0 startups, Google AppEngine, Ruby on Rails, PHP, Visual Studio Team System, Team Foundation Server and .NET.

Wednesday, March 5, 2008

Text search in TFS - something to Hadoop?

Hey guys,

I thought I would post about a project I'm starting as a way of learning more about Hadoop - the cluster computing framework that I posted about earlier.

I think of Hadoop as the triumph of brute force - that is, simple algorithms that don't scale are now possible again because Hadoop helps you handle scaling by enabling you to add more clusters.

What I had in mind is trying to implement free text search for work items in Team Foundation Server.

One of the problems of trying to implement text search in TFS (or anything for that matter) is making sure your solution can scale.

A very simple algorithm for enabling TFS search would look something like:

1. Dump the TFS SQL Server tables related to work items out to to a text file
2. Index that text file with a search engine
3. Repeat on some schedule to keep the indexes current

The problem of course is that dumping the SQL tables will eventually eat up a lot of hard drive space and that indexing those files will take longer as you get more data.

I think this is an opportunity to apply Hadoop. Here is what I have in mind.

1. Dump the TFS SQL Server tables related to work items out to to a text file

There isn't a way to avoid reading the tables and writing out files. Hopefully, reading the tables won't take too much time. To reduce IO bottlenecks as much as possible, I'll write out the files to HDFS (Hadoop Distributed File System). This file system is an open source implementation of the fabled Google distributed file system. It uses commodity hardware to provide a robust, endless amount of hard drive space. I'm hoping writing files out to this system will be reasonably fast. I can write out a new file for each work item if I want to. The other solution here would be to use Amazon S3, but I don't think S3 storage can be indexed as easily. So it's the Hadoop file system for now.

2. Index that text file with a search engine

We'll actually be indexing multiple text files here. I'm planning on using the Apache Lucene search engine to do this. Lucene and Hadoop belong to the same project. I think Hadoop actually came out of some of the work that was being done for Lucene. I know that Rackspace.com is using Lucene and Hadoop together to index their log files and they have shared their experience. There should be some things I can leverage.

3. Repeat on some schedule to keep the indexes current

Since I'm using Hadoop to divide up my indexing, I should be able to repeat this as much as I want to.

Where do I get a cluster?

I don't have a cluster handy, so I'll rent some time on Amazon Elastic Computing. Looking at their prices, I think I can do the development for less than $20 or so.

Anyways, just wanted to share what I had in mind. As I make my way through this project, I'll keep you posted.

Thanks!

Eric.

3 Comments:

Blogger alessiot said...

Hi Eric,

I'm trying to do something similar. Since you want to combine Lucene and Hadoop, why don't you have a look at Nutch?

-Alessio

April 5, 2008 1:33 PM

 
Blogger ericlee said...

Cool, I'll check it out thanks!

April 9, 2008 12:11 PM

 
Blogger RU said...

Hi Eric,

I'm also working on using Hadoop to distribute my indexing using Lucene. Rackspace does have an implementation of it but I wasn't able to find anything other than numbers and statistics. Would be great if you could share your experience if you have come up with some kind of implementation.

-Ritesh

July 16, 2008 3:46 PM

 

Post a Comment

Links to this post:

Create a Link

<< Home