Hadoop Professionals

A Community for Hadoop Users

hello all,
I am new with hadoop I am reading books as per suggested in forum. I want some serious help with hadoop. I am planing to do my masters project in Hadoop but i am not able to set any project objective. Can some one help me by suggesting some objectives so that i can consider them as my masters project.

Thanks in advance

Prajyot

Tags: project, topic

Reply to This

Replies to This Discussion

There are a number of areas that hadoop could use some help in.

The number one tool would be a setup verification tool, that actually launches tasks on all of the cluster machines and verifies that all of the required communication paths are open and that for each machine, it can communicate correctly with each other machine in the cluster.
Providing informative feedback to the errors found would be magic.
The common issues are:
1) keyless ssh
2) java home and java tools
3) incorrect hostname to ip address mapping for cluster machines, on some nodes
4) firewalls preventing connections or data transfers between the machines on the various configured ports
5) checking for rate limiting on the network connections
6) checking for permissions and free space on the various file system locations being configured for write access
7) warning if any of the configured file system locations are in /tmp or on a tmpfs file system (and will be deleted unexpectedly).

And an ongoing tool that monitored the cluster and informed you in a clear and simple way when a service failed Datanode/Tasktracker, or a machine dropped out, or becomes unresponsive.

It is very easy for the secondary namenode to exit, or be unable to update the fsimage and for there to be no user visible indication of this.

When a machine experiences a memory short fall, intense IO pressure or cpu or network saturation.

Another very useful thing would be something simple that informed you when the core hadoop services or a task being run by the tasktracker is garbage collecting excessively.

I have had some interesting hdfs problems when the datanodes start going into GC pauses, and without monitoring it would have been very difficult to work out what the problem was.


A wonderful administration console that presented all of this information + summaries of the exception rates from the logs and allowed the rapid and reliable deployment and undeployment of Data/Tasktracker nodes to a cluster would be fantastic.

Reply to This

Reply to This

RSS

Groups

© 2010   Created by Jason Venner.   Powered by .

Badges  |  Report an Issue  |  Terms of Service

Sign in to chat!