Discussion, code samples and video demos of new technologies; including Web 2.0 startups, Google AppEngine, Ruby on Rails, PHP, Visual Studio Team System, Team Foundation Server and .NET.

Monday, March 10, 2008

Quick Thing with TFS and Excel Pivot Tables

Hey guys,

Excel pivot tables are a great way to quickly pull data from a TFS data warehouse. I came across one for looking at code churn that I thought I would share. It's pretty simple, but I thought the information you get back could be really useful.

Here is a link to the video: http://www.screencast.com/t/5F8s9NvnelK

Thanks!

Eric.

Friday, March 7, 2008

So many SDKs, so little time

It just occurred to me last night that we are the middle of a rising swell of APIs and SDKs. For example, in just the last year, we have seen:
  • FaceBook open up with an API, thus enabling developers to tap into an audience that is 30 million strong
  • MySpace, LinkedIn, et al announcing support for Google's OpenSocial API
  • Apple's iPhone announcing their SDK just yesterday
  • SkyHook releasing their super-cool IP-based location services
  • Amazon's Web Services have been around for a few years now, but it seems like they are really starting to get some mainstream adoption - just look at how many links to downloads have s3 in them now.
And these are just the non-Microsoft APIs. Based on all the announcements at MIX, I'm sure we'll see a flurry of cloud-based APIs from Microsoft shortly.

Amazing, times for sure.

Eric.

Wednesday, March 5, 2008

TFS and Hadoop Searching - Breaking Ground

Hey guys,

OK, who knows how this will turn out, but let's break some ground and see if we can build a credible search solution with Hadoop and TFS.


Let's make an assumption here about our solution. In order to use a free text search engine like Lucene, we need to have text files that contain textual representations of our work items. Once we have these files, we can index them and presumably search.

Of course we don't want to keep dumping our database out as text. We can probably do a few things to optimize this. We can:
  • dump periodically
  • read the work item changed events to know when work items have changed
  • poll our database tables for changes and dump if necessary
Before we make that decision, we should look at the tables that represent our work items. In a TFS database, they are in the TfsWorkItemTracking database. The tables we want to work with are shown below.








For now, I'm just going to worry about the WorkItemsAre and WorkItemsWere tables. I know the WorkItemsLongTexts has the descriptions, which is ripe for indexing, but I'll come back to that table later on.

To dump these tables out as text, I'm just going to use osql to do it. A command like the following will dump out these tables as files:

osql -S tfsrtm08 -d TfsWorkItemTracking -Q "select * from WorkItemsAre" -o workitems.csv -w 1024 -s ,

This gives me a comma separated file where each work item is a row. This might be a pattern that we use for mapping and reducing later on. Then again, it might not be, because we know each row is a different work item. Anyways, we'll figure that out later.

We can dump the WorkItemsWere table in the same way. That output is more applicable to mapping and reducing. Each row has column 'ID' that indicates the work item ID where a field has changed.

Off the top of my head, what I think we'll end up doing is dumping out the WorkItemsAre table and the WorkItemsWere table into text. Then running a map/reduce algorithm on both.

In the WorkItemsAre text file, our map function will just divide up the input file into rows. Each row becomes a job. The reduce function for this is pretty simple - we will just create a directory for the work item and create an initial Lucene index that indexes these text fields.

I think we may just map/reduce the WorkItemsAre table the very first time.

In the WorkItemsWere table, our map function will key off of the ID column in that table. In the reduce function, we will index the text for each work item and store that index in its appropriate folder.

After the very first time, I think we will just map/reduce the WorkItemsWere table.

I'm not going to worry about the WorkItemsLongText table right now, but I think we would map/reduce it like we do the WorkItemsAre table.

Come to think of it, I might not do the indexing in the reduce functions as described. I might just write out the appropriate files. The actual indexing could be another map/reduce set of jobs.

So the very first easy part of figuring out how to dump SQL to text files with the least amount of code possible is done. We'll just run osql.exe.

Next, I think I'll see how I can use Hadoop to divide up each of those files.

Stay tuned...

Text search in TFS - something to Hadoop?

Hey guys,

I thought I would post about a project I'm starting as a way of learning more about Hadoop - the cluster computing framework that I posted about earlier.

I think of Hadoop as the triumph of brute force - that is, simple algorithms that don't scale are now possible again because Hadoop helps you handle scaling by enabling you to add more clusters.

What I had in mind is trying to implement free text search for work items in Team Foundation Server.

One of the problems of trying to implement text search in TFS (or anything for that matter) is making sure your solution can scale.

A very simple algorithm for enabling TFS search would look something like:

1. Dump the TFS SQL Server tables related to work items out to to a text file
2. Index that text file with a search engine
3. Repeat on some schedule to keep the indexes current

The problem of course is that dumping the SQL tables will eventually eat up a lot of hard drive space and that indexing those files will take longer as you get more data.

I think this is an opportunity to apply Hadoop. Here is what I have in mind.

1. Dump the TFS SQL Server tables related to work items out to to a text file

There isn't a way to avoid reading the tables and writing out files. Hopefully, reading the tables won't take too much time. To reduce IO bottlenecks as much as possible, I'll write out the files to HDFS (Hadoop Distributed File System). This file system is an open source implementation of the fabled Google distributed file system. It uses commodity hardware to provide a robust, endless amount of hard drive space. I'm hoping writing files out to this system will be reasonably fast. I can write out a new file for each work item if I want to. The other solution here would be to use Amazon S3, but I don't think S3 storage can be indexed as easily. So it's the Hadoop file system for now.

2. Index that text file with a search engine

We'll actually be indexing multiple text files here. I'm planning on using the Apache Lucene search engine to do this. Lucene and Hadoop belong to the same project. I think Hadoop actually came out of some of the work that was being done for Lucene. I know that Rackspace.com is using Lucene and Hadoop together to index their log files and they have shared their experience. There should be some things I can leverage.

3. Repeat on some schedule to keep the indexes current

Since I'm using Hadoop to divide up my indexing, I should be able to repeat this as much as I want to.

Where do I get a cluster?

I don't have a cluster handy, so I'll rent some time on Amazon Elastic Computing. Looking at their prices, I think I can do the development for less than $20 or so.

Anyways, just wanted to share what I had in mind. As I make my way through this project, I'll keep you posted.

Thanks!

Eric.