Hadoop Professionals

A Community for Hadoop Users

Hello all,

Hope all is well in the community. I am inquiring on how to apply hadoop to retrieve information from various blogs, news feeds, etc.. in a particular fashion.

I have identified three groups of word pairs that are valuable to me. I would like to explore the clustering patterns among particular URL's of these particular word pairs in their respective blog spaces, news feeds, etc.

So, given that I have an expected output structure i.e. three groups of words that I believe to have distinct attributes, I am aware that I am trying to develop a supervised learning method.

The question is, is there a simple way to develop this procedure with Hadoop? Where I can search the web spaces for lets say one week, then record the following information: common occurrence of a particular group of words; host name; and any other interesting meta tag information that I find relevant. Then sequentially analyze the data with a supervised clustering technique that is supported by dendogram and hierarchical cluster graphical output.

I would greatly appreciate any information.

Best,
Mark

Views: 2

Comment

You need to be a member of Hadoop Professionals to add comments!

Join Hadoop Professionals

Mark Cejas Comment by Mark Cejas on February 15, 2010 at 7:56am
Hi Jason,

Thanks for the valuable information.

Mark
Jason Venner Comment by Jason Venner on February 15, 2010 at 6:54am
For webcrawling for your data collection, perhaps nutch or heritrix.

The mahout project provides a rich set of tools for clustering.

Carrot2 provides some decent visualization tools for small data sets.



Groups

© 2012   Created by Jason Venner.

Badges  |  Report an Issue  |  Terms of Service