TFS and Hadoop Searching - Breaking Ground
Hey guys,
OK, who knows how this will turn out, but let's break some ground and see if we can build a credible search solution with Hadoop and TFS.
Let's make an assumption here about our solution. In order to use a free text search engine like Lucene, we need to have text files that contain textual representations of our work items. Once we have these files, we can index them and presumably search.
Of course we don't want to keep dumping our database out as text. We can probably do a few things to optimize this. We can:
- dump periodically
- read the work item changed events to know when work items have changed
- poll our database tables for changes and dump if necessary

For now, I'm just going to worry about the WorkItemsAre and WorkItemsWere tables. I know the WorkItemsLongTexts has the descriptions, which is ripe for indexing, but I'll come back to that table later on.
To dump these tables out as text, I'm just going to use osql to do it. A command like the following will dump out these tables as files:
osql -S tfsrtm08 -d TfsWorkItemTracking -Q "select * from WorkItemsAre" -o workitems.csv -w 1024 -s ,
This gives me a comma separated file where each work item is a row. This might be a pattern that we use for mapping and reducing later on. Then again, it might not be, because we know each row is a different work item. Anyways, we'll figure that out later.
We can dump the WorkItemsWere table in the same way. That output is more applicable to mapping and reducing. Each row has column 'ID' that indicates the work item ID where a field has changed.
Off the top of my head, what I think we'll end up doing is dumping out the WorkItemsAre table and the WorkItemsWere table into text. Then running a map/reduce algorithm on both.
In the WorkItemsAre text file, our map function will just divide up the input file into rows. Each row becomes a job. The reduce function for this is pretty simple - we will just create a directory for the work item and create an initial Lucene index that indexes these text fields.
I think we may just map/reduce the WorkItemsAre table the very first time.
In the WorkItemsWere table, our map function will key off of the ID column in that table. In the reduce function, we will index the text for each work item and store that index in its appropriate folder.
After the very first time, I think we will just map/reduce the WorkItemsWere table.
I'm not going to worry about the WorkItemsLongText table right now, but I think we would map/reduce it like we do the WorkItemsAre table.
Come to think of it, I might not do the indexing in the reduce functions as described. I might just write out the appropriate files. The actual indexing could be another map/reduce set of jobs.
So the very first easy part of figuring out how to dump SQL to text files with the least amount of code possible is done. We'll just run osql.exe.
Next, I think I'll see how I can use Hadoop to divide up each of those files.
Stay tuned...


0 Comments:
Post a Comment
Links to this post:
Create a Link
<< Home