Hadoop Professionals

A Community for Hadoop Users

This network is a place to discuss and learn Hadoop, Solr, Katta, Map Reduce and a place to discuss hadoop resources such as hadoop books.

Members

  • Andy Nahapetian
  • Jason Venner
  • yigang chen
  • Kim
  • Chad
  • sanjo cole
  • Rodrigo Carvalho Rezende
  • Shevek
  • Anand
  • dachuan huang
  • Tramadol naloxone
  • Tramadol hcl ultram
  • Facial edema prednisone
  • Tramadol doasage
  • Clomid how soon pregnancy test
  • Clomid with metformin

Latest Activity

Andy Nahapetian is now a member of Hadoop Professionals
on Friday
Jason Venner added an event
Bay Area Hadoop User Group (HUG) March Meetup at Yahoo Campus Building C, Second Floor, Classroom 5
March 11, 2010 from 6pm to 7pm
Building C, Second Floor, Classroom 5 It's in the same campus, just cross the street and walk pass building D to Building C 6:00 - 6:20 - Socializing and Beers 6:20 - 6:50 - Preview to the Hadoop Security Release Owen O'Malley, Yahoo! 6:50 - 7:2…
on Friday
You can build your own symbolic link by running a command from java, you just need to verify where the data is unpacked, and then build a link to it. A quick search turned up the following page for sample java code for you: http://www.giannistsakir
on Wednesday
If your data is highly relational, your users will have a simpler time accessing it if it is stored in a more traditional data warehouse. The sizes you are talking about are very small, I have some of the higher end solid state devices for storage,…
on Wednesday
Rodrigo Carvalho Rezende and sanjo cole joined Hadoop Professionals
on Tuesday
So that means I'd need to modify the legacy code, i.e., change the hard coded: "a/relative/path/to/my/file.xml" to: "./mymeta.zip/a/relative/path/to/my/file.xml" Is there a way at all to NOT change the legacy code?
March 8
sanjo cole added a discussion
hi,i'm working on a data warehouse and am deciding whether to use hadoop or mysql.the dataset is currently likely to be no bigger than 40gb for the first year, then perhaps 80gb for the next year, and possibly 120gb the year after.we want to be able…
March 8
Anand updated their profile
March 8
Anand and yigang chen joined Hadoop Professionals
March 8
where can i find this code bundle? at present, i just want to run some simple examples which just some internal classes are involed, it's not necessary to re-build the whole hadoop source project. thank you.
March 7
If you pass -archives mymeta.zip there will be a symbolic link in the current working directory for the map or reduce task mymeta.zip, which points to the directory that the archive was unpacked in. so if you use ./mymeta.zip/path_in_archive/file.xm…
March 7
In the code bundle for the book ProHadoop is a full eclipse environment for running with either hadoop 18.3, 19.0 or 19.1. At present I typically use that and maven for building my production code
March 7
yigang chen added a discussion
Hi,   I'm trying to move my legacy data procfessing code to hadoop. My issue is the legacy code relies on local file system - it both reads and writes meta data. When the code access local data it typically uses relative path, like this: "meta-dir/g…
March 7
I've already used an awkward way to solve this problem, just copy the source folder one by one manually into my eclipse project, and then copy the jars into my eclipse project, this is not a perfect solution because anytime i have made some modifica…
March 7
shout.
March 6
based on my understanding and experiences, the reducers starts before mappers finish. because there are four stages in "reduce": copy phase, append phase, sort phase, reduce phase. exactly speaking, append phase starts after all mappers finishes, bu…
March 6

Photos

Loading…
 

Help With Hadoop

A great place to learn Hadoop, and to tune your map reduce jobs.

Ask specific Hadoop questions here to get help from an expert :)

Forum

yigang chen

Legacy code running in mapper 3 Replies

Started by yigang chen. Last reply by Jason Venner Mar 10.

sanjo cole

hadoop or mysql? 1 Reply

Started by sanjo cole. Last reply by Jason Venner Mar 10.

dachuan huang

how could i setup a hadoop source project? 3 Replies

Started by dachuan huang. Last reply by dachuan huang Mar 8.

Events

Blog Posts

Marc Sturlese

datanode can not connect to the namenode in a small hadoop cluster

Hey there I have a hadoop cluster build on 2 servers (2 laptops). One node (A)

contains the namenode, a datanode, the jobtraker and a tasktraker.


The other node(B) just has a datanode and a tasktraker.

I set up correctly hdfs with ./start-hdfs.sh


When I try to set up MapReduce with ./start-mapred.sh the
TaskTraker of node (B) can not connect to the namenode. The tasktracker log will

keep throwing:



INFO org.apache.hadoop.ipc.Client: Retrying conne… Continue

Posted by Marc Sturlese on February 15, 2010 at 7:00am — 3 Comments

Mark Cejas

seeking advice on word vectors

Hello all,


Hope all is well in the community. I am inquiring on how to apply hadoop to retrieve information from various blogs, news feeds, etc.. in a particular fashion.



I have identified three groups of word pairs that are valuable to me. I would like to explore the clustering patterns among particular URL's of these particular word pairs in their respective blog spaces, news feeds, etc.



So, given that I have an expec
Continue

Posted by Mark Cejas on February 13, 2010 at 10:41am — 2 Comments

Mark Cejas

.bashrc file error

Hello all,

I hope that the holidays are going well,
I finally have my graduate school work behind me and have more time to learn about this wonderful Hadoop tool. I work on a Fedora 11 distribution and upon getting my JAVA_HOME and HADOOP_HOME paths set, I started to encouter the following error. The error is is observed upon establishing root user as follows:

[rasaan@rasaan ~]$ su
Password:
bash: /root/.bashrc: line 9: unexpected EOF while looking for matching `)'
bash: /root/.bashrc: line 14… Continue

Posted by Mark Cejas on December 31, 2009 at 12:23pm — 1 Comment

Jason Venner

I am giving a talk at the HUG on Wed, scaling search with hadoop, katta and solr

Jason Rutherglen will be providing the in depth lucene/solr pieces.

Hope to see you there.

Posted by Jason Venner on November 17, 2009 at 12:57pm

Yahoo Hadoop Developer Blog

Hadoop Bay Area User Group - March 24th at Yahoo!

Hi Hadoopers,

I'm excited to invite you to the next Hadoop Bay Area User Group which will be held on March 24th, 6PM at the Yahoo! Sunnyvale Campus.

The Hadoop community is growing on an impressive pace and we had more than 150 attendees at the last meet up.

We invite you to attend whether you are an active submitter, developing Hadoop-based applications or completely new to the Apache Hadoop world. In addition to interesting presentations you will enjoy food, beer and great networking.

We have a diverse plan for this event, comprised of 3 sessions:

Owen O'Malley from the Yahoo! Hadoop Team will provide an overview of the upcoming Hadoop Security release. Owen will describe the features and capabilities included as well as operational benefits. Yahoo! is very excited about adding security capabilities to Hadoop and views this as major milestone in continuing to make Hadoop an enterprise-grade platform.

Tyson Condie a Ph.D. student at the University of California, Berkeley, will discuss the innovative research around Hadoop Online efforts lead by Prof. Joseph M. Hellerstein . Tyson will describe a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, can reduce completion times and improve system utilization. Tyson will include examples from the HOP - Hadoop Online Prototype project.

Bradford Cross from Flightcaster flight delays predication service will describe how they built a scalable machine learning platform. The system is using Clojure dynamic programming language wrapping Cascading and Hadoop. It is deployed on the Amazon EC2 Cloud. Bradford will describe how the use of Hadoop makes building scalable systems simple.

For those of you who are not able to attend in person the session's slides and video recording will be posted on this blog after the event. Stay tuned!

Due to the growing demand we are moving the meetup location to a larger facility (in the same campus) - 701 First Av., Sunnyvale, Building C, 2nd floor Classroom 5.

Please RSVP at the Hadoop Bay Area Meetup page

Looking forward to see you all soon.

Dekel Tankel
Director, Product Management
Cloud Computing at Yahoo!

Continue

Do you have what it takes to test Hadoop?

In my previous post Do you have what it takes to join Yahoo!'s US Hadoop Team? I talked about a number of technical positions we have open in the Yahoo! Hadoop team. The post was a big success. We received a number of excellent resumes, and ended up hiring a few great engineers.

But today I want to talk about a particular challlenge we have - testing Apache Hadoop. Testing distributed systems is a hard job. In addition to testing functionality, one has to deal with scalability, reliability, security, non-determinism, and a number of other tough challenges. The larger the distributed system, the harder it is to test. Yahoo! currently runs Hadoop on clusters of up to 4000 servers, and this is not the limit. These are some of the largest distributed systems in the world.

Also, the way Yahoo! uses Hadoop is changing. Previously, most Hadoop users at Yahoo! were researchers. Reseraches are usually hungry for scalability and features, but they are fairly tolerant of failures. Few scientists even know what "SLA" means, and they are not in the habit of counting the number of nines in your uptime. Today, more and more of Yahoo! production applications have moved to Hadoop. These mission-critical applications control every aspect of Yahoo!'s operation, from personalizing user experience to optimizing ad placement. They run 24/7, processing many terabytes of data per day. They must not fail.

So we are looking for software engineers who want to help us make sure Hadoop works for Yahoo! and the numerous Hadoop users outside Yahoo!. We are not looking for regular testers. We are looking for outstanding software engineers and architects who are interested in quality and testing. We are looking for people who understand what testing distributed systems is all about, and who know Java and SQL cold. We are looking for people who understand performance, scalability and reliability testing. People who understand how to automate complex use cases and failure scenarios. People who can lead teams and mentor junior staff.

So do you have what it takes to test Hadoop? Please send your resume to hadoop-jobs-2009@yahoo-inc.com.

Mark Tsimelzon
Director of Engineering
Hadoop Team

Continue

Yahoo! Cloud Data Platform and Services team is hiring!

The volume of data in the Yahoo! Cloud is growing and so is the Cloud Data Platform and Services team. Do you have a passion to tackle the challenges of a computing and storage system that handles petabytes of data every day? Do you have the skills to build software and systems for monitoring and managing distributed processes on hundreds of thousands of CPU cores?

The Yahoo! Cloud Data Platform and Services team is building platforms and services to enhance the Hadoop distributed computing and storage system. This includes metadata management for business data, monitoring and system management, and platforms for efficient development of massively parallel data processing programs on top of Hadoop.

Here are some of the leading positions we are looking for:

Principal Software Systems Engineer, Cloud Data Platform and Services Grid Monitoring and Management Software
Software development of system management components for the Grid. The system management components and applications include a framework and a tool set to operate, monitor and maintain the health of all Hadoop Grid-based services, spanning Hadoop core components, ancillary services, and Grid applications. Developing this system requires a good understanding of the management of highly distributed and complex data processing systems. Sample components of the system are a high bandwidth transport for system and application events, a storage repository, alerting, correlation algorithms, inference and reporting capabilities using Java and related technologies.
Location: Sunnyvale, CA, USA
See detailed job description here.

Principal Software Systems Engineer
Challenges include designing and building high-performance, distributed and fault-tolerant implementations of data stores that can scale to handle massive amounts of data with reduced operational cost, and meet the requirements of various Yahoo! businesses. You must be a quick learner, have good communication skills, and be able to maintain ownership of large engineering projects through their lifecycle: architecture/design, implementation, testing, post-release maintenance and support. This engineer should be able to write code in Perl, C++/Java and Map-Reduce constructs in addition to excellent communication skills, both written and presentation, to evolve the vision of massively scalable data stores in the Cloud.
Location: Santa Clara, CA, USA
See detailed job description here.

About Cloud Data Platform and Services
Yahoo!’s Cloud Data Platform and Services team builds data systems infrastructure, applications and services using Hadoop and related Cloud technology components. The Cloud Data team handles a majority of data needs at Yahoo! across advertising systems and online properties that serves upwards of 500 million customers. The system is built for scale and low latency. It handles a majority of revenue generated at Yahoo! and integrates with a multitude of systems inside and outside Yahoo! Data is very important to everything we do at Yahoo! We are looking for top technical talent to help build our next generation data system by leveraging, innovating and building Cloud based technology platforms and frameworks that can form the Grid based data backbone at Yahoo! .

If you find the above description exciting and you have what it takes to build software, services and platforms for the Cloud, I would be happy to hear from you and explore the possibility of joining our growing team. Please send us details about yourself.

Amir Youssefi
Engineering Manager, Cloud Computing and Data Infrastructure, Yahoo!

Continue

Hadoop Bay Area User Group - Feb 17th at Yahoo! - RECAP

Hi Hadoopers,

Thanks everyone for joining us last night at the Yahoo!’s Sunnyvale campus. There were more than 150 attendees, the community is growing!. It was great to see many new faces and companies/solutions that are basing their business on Hadoop.

For those of you who were unable to attend in person the session's details and slides are posted below.

Owen O'Malley from the Yahoo! Hadoop Team provided an overview of the upcoming Hadoop release plans from Yahoo!./p>

Kevin Weil who leads the analytics team at Twitter provided an overview of Hadoop and Protocol Buffers at Twitter. Kevin outlined the challenges and requirements for storing and analyzing large amounts of data and how Protocol Buffers, Map Reduce and Pig LoadFuncs are used to address the problem.

Continue

Comparing Pig Latin and SQL for Constructing Data Processing Pipelines

I have been asked by users who are going to construct a data pipeline whether they should use Pig Latin or SQL.

For those of you who are not familiar with Pig, it is a platform for analyzing large data sets. It is built on Hadoop and provides ease of programming, optimization opportunities and extensibility. Pig Latin is the relational data-flow language and is one of the core aspects of Pig.

In this blog I refer to "data pipeline" as the means by which applications that take data from one or more sources, cleanse it, do some initial transformation on it that all the data readers will need, and then store it in a data warehouse . As SQL is known by almost everyone, it is often chosen as the language in which to write these data pipelines.

We are comparing Pig Latin over Hadoop to SQL over a relational database.

SQL's ubiquity is convenient. However, I believe that Pig Latin is a more natural choice for constructing data pipelines, for several reasons:

  1. Pig Latin is procedural, where SQL is declarative.
  2. Pig Latin allows pipeline developers to decide where to checkpoint data in the pipeline.
  3. Pig Latin allows the developer to select specific operator implementations directly rather than relying on the optimizer.
  4. Pig Latin supports splits in the pipeline.
  5. Pig Latin allows developers to insert their own code almost anywhere in the data pipeline.

I will consider each of these points in turn.

Pig Latin is Procedural

Since Pig Latin is procedural, it fits very naturally in the pipeline paradigm. SQL on the other hand is declarative. Consider, for example, a simple pipeline, where data from sources users and clicks is to be joined and filtered, and then joined to data from a third source geoinfo and aggregated and finally stored into a table ValuableClicksPerDMA. In SQL this could be written as:

insert into ValuableClicksPerDMAselect dma, count(*)
from geoinfo join (
                select name, ipaddr
                from users join clicks on (users.name = clicks.user)
                where value > 0;
            ) using ipaddr
group by dma;

The Pig Latin for this will look like:

Users                = load 'users' as (name, age, ipaddr);Clicks               = load 'clicks' as (user, url, value);
ValuableClicks       = filter Clicks by value > 0;
UserClicks           = join Users by name, ValuableClicks by user;
Geoinfo              = load 'geoinfo' as (ipaddr, dma);
UserGeo              = join UserClicks by ipaddr, Geoinfo by ipaddr;
ByDMA                = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';

Notice how SQL forces the pipeline to be written inside-out, with operations that need to happen first happening in the from clause sub-query. Of course this can be resolved with the use of intermediate or temporary tables. Then the pipeline becomes a disjointed set of SQL queries where ordering is only apparent by looking at a master script (written in some other language) that sews all the SQL together. Also, depending on how the database handles temporary tables, there may be cleanup issues to deal with. In contrast, Pig Latin shows users exactly the data flow, without forcing them to either think inside out or construct a set of temporary tables and manage how those tables are used between different SQL queries.

The pipeline given above is obviously simple and contrived. It consists of only two very simple steps. In practice data pipelines at large organizations are often quite complex, if each Pig Latin script spans ten steps then the number of scripts to manage in source control, code maintenance, and the workflow specification drops by an order of magnitude.

Checkpointing Data

Experienced data pipeline developers will object to the point above about Pig Latin not needing temporary tables. They will note that storing data in between operations has the advantage of check pointing data in the pipeline. That way, when a failure occurs, the whole pipeline does not have to be rerun. This is true. Pig Latin allows users to store data at any point in the pipeline without disrupting the pipeline execution. The advantage that Pig Latin provides is that pipeline developers decide where appropriate checkpoints are in their pipeline rather than being forced to checkpoint wherever the semantics of SQL imposes it. So, if for the above pipeline there was a need to store data after the second join (UserGeo) and before the group by (ByDMA), the script could be changed to:

Users                = load 'users' as (name, age, ipaddr);Clicks               = load 'clicks' as (user, url, value);
ValuableClicks       = filter Clicks by value > 0;
UserClicks           = join Users by name, ValuableClicks by user;
Geoinfo              = load 'geoinfo' as (ipaddr, dma);
UserGeo              = join UserClicks by ipaddr, Geoinfo by ipaddr;
store UserGeo into 'UserGeoIntermediate';
ByDMA                = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';

This would result in no additional Map Reduce jobs. Pig would store the intermediate data after the aggregation and continue with the join as before.

Faith in the Optimizer

By definition, a declarative language allows the developer to specify what must be done, not how it is done. Thus in SQL users can specify that data from two tables must be joined, but not what join implementation to use. Developers are forced to have faith that the optimizer will make the right choice for them. Some databases work around this by allowing hints to be given to the optimizer, but even then the implementation is not required to follow those hints.

While for many SQL applications the query writer may not have enough knowledge of the data or enough expertise to specify an appropriate join algorithm, this is not usually the case for data pipelines. Data flowing through data pipelines does not tend to vary significantly from run to run, in terms either of volume or key distribution. In addition data pipeline developers are usually sophisticated enough to choose the correct algorithm. For these reasons allowing developers to explicitly choose an implementation, and be guaranteed that their choice will be honored, is quite useful in data pipelines.

Pig Latin allows users to specify an implementation or aspects of an implementation to be used in executing a script in several ways. For joins and grouping operations users can specify an implementation to use, and Pig guarantees that it will use that implementation. Currently Pig supports four different join implementations and two grouping implementations. It also allows users to specify parallelism of operations inside a Pig Latin script, and does not require that every operator in the script have the same parallelization factor. This is important because data sizes often grow and shrink as data flows through the pipeline.

Splits in Pipelines

Another common feature of data pipelines is that they are often graphs (DAGs) and not linear pipelines. SQL, however, is oriented around queries that produce a single result. Thus SQL handles trees (such as joins) naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. A very common use case we have seen in Yahoo! is a desire to read one data set in a pipeline and group it by multiple different grouping keys and store each as separate output. Since disk reads and writes (both scan time and intermediate results) usually dominate processing of large data sets, reducing the number of times data must be written to and read from disk is crucial to good performance.

Take for example a user data set where there is a desire to analyze the data set both in geographic and demographic dimensions. The Pig Latin to do this analysis looks like:

Users         = load 'users' as (name, age, gender, zip);Purchases     = load 'purchases' as (user, purchase_price);
UserPurchases = join Users by name, Purchases by user;
GeoGroup      = group UserPurchases by zip;
GeoPurchase   = foreach GeoGroup generate group, SUM(UserPurchases.purchase_price) as sum;
ValuableGeos  = filter GeoPurchase by sum > 1000000;
store ValuableGeos into 'byzip';
DemoGroup     = group UserPurchases by (age, gender);
DemoPurchases = foreach DemoGroup generate group, SUM(UserPurchases.purchase_price) as sum;
ValuableDemos = filter DemoPurchases by sum > 100000000;
store ValuableDemos into 'byagegender';

This Pig Latin script describes a DAG rather than a pipeline. It starts with two inputs which are brought into one stream (via join) which is then split into two streams. Pig will do this in two Map Reduce jobs (one for the join and one for both group bys and their filters) rather than requiring that the join be either run twice or materialized as an intermediate result as traditional SQL would.

Inserting Developer Code

Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development. This is accomplished through user defined functions (UDFs) and streaming. UDFs allow users to specify how data is loaded, how it is stored, and how it is processed. Streaming allows users to include executables at any point in the data flow.

Allowing developers to specify how data is loaded is useful because in most data pipelines data sources are not database tables. If SQL is used, data must first be imported into the database, and then the cleansing and transformation process can begin. There are many ETL tools on the market to handle this import process for databases. Pig allows developers to write a function in Java to read data directly from the source. This eliminates the need for a second tool which must be purchased, learned, and used and allows the data pipeline to combine the loading and initial cleansing and transformation steps.

Pipelines also often include user defined column transformation functions and user defined aggregations. Pig Latin supports writing both of these types of functions in Java. We plan to extend that to a number of scripting languages in the near future, thus enabling users to easily write UDFs in the language of their choice.

If the user defined code will not fit well into a UDF, streaming allows pipelines to place an executable in the pipeline at any point. This can also be used to include legacy functionality that cannot be modified.

To conclude, I hope you will agree with me that these advantages of an intuitive, procedural programming model, control of where data is check pointed in the pipeline, the ability to completely control how data is processed, support for general DAGs, and the ability to include user code wherever necessary make Pig Latin a better choice for developing data pipelines on Hadoop.

Alan Gates, Architect
Pig Development Team, Yahoo!
Continue

Cloudera Hadoop Blog

Cloudera speaks VMware vCloud API, too.

We’ve announced, with VMware, the ability to use third-party vCloud Express service providers and the vCloud API to run Cloudera’s Distribution for Hadoop. We think this is interesting; as cloud services proliferate, it’s important to be able to move easily among public and private clouds. vCloud makes that easier and VMWare is working hard to [...]

Hadoop World: Building Data Intensive Apps with Hadoop and EC2

Today’s Hadoop World Talk comes from Pete Skomoroch, and dives into detail about how he built TrendingTopics.org using Hadoop and EC2.

Hadoop World: Making Hadoop Easy on Amazon Web Services

Today’s Hadoop World talk comes from Peter Sirota, who leads Amazon Web Service’s Elastic MapReduce team. In this talk, Peter provides more detail on the platform, shares some new features, and shows how the AWS community, from customers to developers, are making things easier with Hadoop.

Hadoop World: Hadoop Applications at Yahoo!

Today’s Hadoop World talk comes from Eric Baldeschwieler, Yahoo!’s VP of Hadoop Development. In this talk, Eric highlights Yahoo’s contributions to development and testing of Hadoop at scale, and goes into detail about how Yahoo! uses Hadoop to deliver several popular services. A major thanks to Eric, and everyone else at Yahoo! for their [...]

7 Tips for Improving MapReduce Performance

One service that Cloudera provides for our customers is help with tuning and optimizing MapReduce jobs. Since MapReduce and HDFS are complex distributed systems that run arbitrary user code, there’s no hard and fast set of rules to achieve optimal performance; instead, I tend to think of tuning a cluster or job much like a [...]
 
 

Badge

Loading…
 

© 2010   Created by Jason Venner on Ning.   Create a Ning Network!

Badges  |  Report an Issue  |  Privacy  |  Terms of Service

Sign in to chat!