A Community for Hadoop Users
This network is a place to discuss and learn Hadoop, Solr, Katta, Map Reduce and a place to discuss hadoop resources such as hadoop books.
Started by Khaled. Last reply by Anantha Srivathsa N 1 day ago.
Started by Raj Vishwanathan. Last reply by Raj Vishwanathan Jul 27.
Started by Khaled Jul 20.
Posted by Louisa Landry on March 23, 2010 at 8:45am
Posted by Marc Sturlese on February 15, 2010 at 7:00am — 3 Comments
Posted by Mark Cejas on February 13, 2010 at 10:41am — 2 Comments
Posted by Mark Cejas on December 31, 2009 at 12:23pm — 1 Comment
hadoop. To log in as root, the password is root.hadoopYahoo1234.Last but not least, I would like to thank Devaraj Das and Jianyong Dai from the Yahoo! Hadoop & Pig Develoment team for their help in setting up and configuring Hadoop 0.20.S and Pig respectively.
Notice:
Yahoo! does not offer any support for the Hadoop Virtual Machine.
The software include cryptographic software that is subject to U.S. export control laws and applicable export and import laws of other countries. BEFORE using any software made available from this site, it is your responsibility to understand and comply with these laws. This software is being exported in accordance with the Export Administration Regulations. As of June 2009, you are prohibited from exporting and re-exporting this software to Cuba, Iran, North Korea, Sudan, Syria and any other countries specified by regulatory update to the U.S. export control laws and regulations. Diversion contrary to U.S. law is prohibited.
Suhas Gogate
Technical Yahoo!, Cloud Solutions Team, Yahoo!
This is the beginning of an ongoing series of blog posts on “Managing Big Data”. This series will focus on techniques that Yahoo uses to process large volumes of data, ranging from initial collection of data to the end usage of that data.
Over the last several years there are two important trends that require additional thought when putting together an architecture for a hosted service. At Yahoo!, the ability to analyze and process enormous amounts of data is increasingly important. It’s a foundational layer for improving our consumer experiences and for sharing audience insights with advertisers.
From a technology perspective, the two trends I'd like to focus on are:
1. Batch processing -- the increasing awareness of batch processing and the recent uptick in use of the map/reduce paradigm for that purpose.
2. NoSQL stores – The rise of so called "NoSQL" stores and their use to serve up data to online users (typically inside of the user's request/response cycle).
Both of these trends represent significant advances in the way that hosted systems are developed. But in order to derive the most value for an entire system, developers must think about how these two areas will work together in some holistic manner.
Let's look at a specific scenario to make this more concrete:
Let's assume for the moment that you're building a new e-commerce site. And let's assume that one of the significant features of this e-commerce site is to provide user recommendations for which items a user may be interested in purchasing. This overall feature will decompose into batch component (to determine the recommendations to give to a user) and an online component (that will present the recommendations to end user).
For the batch system we may choose to use a map/reduce framework like Hadoop . The batch recommendation component will rely on various types of raw event data. Some examples include: the products the end user has viewed, the products they have purchased and types of searches the user has performed, etc. We will create a model using this data; perhaps based on a simple behavior model (e.g. if the user looks at a significant amount of sports equipment than recommend products in the sports equipment category) or a collaborative filtering model (e.g. if other users purchased the same products as this end user, recommend other products that they purchased).
Once we have decided on the data inputs and the model for making recommendations, we'll like produce the output of the batch processing as a set of recommendations for each user on a Hadoop cluster.
There are two approaches we can then make for making the batch data available online in a NoSQL store.
In this approach as part of the batch processing, we will recreate the set of recommendations for all users. It also may be possible to create a native version of the online store format in the batch system. This is generally only possible if there is no scenario where the data (user recommendations for products in this case) is updated in the online store. There must also exist a library to write the into the online store's file format (not all NoSQL stores have such as library).
This approach is attractive for a couple of reasons:
Taking an incremental update approach between your offline and online store entails creating a new set incremental data changes in your batch system. For instance, creating a new set of user recommendations for newly joined users or users that have had some recent activity (which would change the recommendations). This incremental data must then be pushed to the online store.
This approach is attractive for a couple of reasons:
One weakness of the incremental update approach is that it can have a performance impact on your online store. Therefore, when you apply updates to the online store you may need to consider some form of throttling of the updates as they are applied to the online store.
In a future post I'll also address taking data that's available in an online NoSQL store and making that data available in batch system like Hadoop.
Dirk Reinshagen
Cloud Architect
Cloud Computing at Yahoo!
At a recent Hadoop User Group meeting, I made a presentation on how we leverage hadoop for spam mitigation in Yahoo! Mail. A number of people followed up requesting additional details of our architecture and engineering strategy.
In this post, I am going to try and capture our antispam engineering story, how it came to be hadoop centric and how well the new architecture has worked. I will also highlight the results we have been able to achieve. Finally, I will provide an update on when we will be releasing these updates to wide production.
At the Hadoop User group presentation, I had delved into the details of two interesting antispam algorithms. The first was "frequent itemset mining", the second was what we called the "connected components" algorithm. Both these algorithms are implemented as part of our tools portfolio. They are used by engineers, product managers and operations analysts to get a compact summary of the major trends in spam. Both these tools were implemented as part of a new engineering strategy we put in place second quarter of last year.
Our new strategy called for pointed improvements in the ability of systems to digest massive amounts of data. The first portion of the strategy, implemented by the end of 2009, targeted our reporting systems, tool chains and existing abuse reputation algorithms. The proposal was to increase the granularity of the data being handled, increase the response time to detect an attack and do to do more early detection of spam attacks. In our analysis, we quickly realized that even the small changes we were proposing to our reporting systems, tools and algorithms required us to scale our existing systems well beyond the limits that they were meant to scale.
Also, we found that engineering, product and customer advocacy teams were all hungry for data and it would be great to support additional requirements around ad hoc joins across data streams and support a general "slice" and "dice" approach to data engineering. Our first revisions to the sender and content classifiers also made it abundantly clear that we needed massive storage and massive compute.
We took the simple approach of putting ALL the data that we would possibly want to query or develop algorithms on, on a hadoop grid and let the grid scale to the storage and compute requirements. To give you some idea of the scale involved here, let me provide some ball park metrics.
By the end of this quarter, we will be loading close to 4TB of antispam data on our hadoop grids every day and we will be querying several days of data for report generation and running automated classifiers and algorithms at a frequency of a few minutes. We have not run into any scalability problems so far. In general we have found that with proper data organization, hadoop is able to scale linearly with data and compute requirements.
I will complete this section by saying that this strategy has had a huge impact on spam complaints. See for you self; I am enclosing the graph of our spam complaints from last year. The big dip corresponds to when we shifted our reports, algorithms and filters to the hadoop grid. Need I say more?

While making changes to how an existing system works is interesting and clearly the first step, the second and more interesting step is the development of brand new, distributed reputation algorithms using hadoop. Once again, our new engineering strategy called for the rewiring of all algorithms to run in parallel and increase the level of feature engineering across heuristic, statistical and machine learning systems. We needed to do this across reputation algorithms for IPs, domains, from addresses(senders), receivers(users) and content. Once again, we realized that much of the complexity was in massive data engineering. We needed to ensure that we used every bit of data that would help us make a spam/notspam decision. We also had to choose an appropriate model that could interpret this vast amount of data without getting overwhelmed.
The term "massive feature engineering" should be familiar to people in the area of machine learning. In more common engineering terms, we needed to associate several pieces of meta data to every entity we needed to classify and we needed to choose algorithms that would parallelize well. We have been hard at work the last 5 months doing this new "hadoop engineering". By the end of this month, we hope to release our first hadoop based, massively feature engineered, distributed sender classification algorithm. Code named zeroB, our initial tests make this a very compelling replacement for our current sender management system. It is 25% more accurate while being faster and cheaper to run and maintain than the current version.
Indeed I have now come to believe that Hadoop has tremendous applicability to the abuse and security domains as a whole. Both these domains have the proverbial problem of finding the needle in the hay stack and hadoop is well equipped for this task. With the amount of spam that large mail systems like yahoo see, it is truly important to employ powerful frameworks like Hadoop to ensure the problem remains tractable.
Yahoo mail was recently voted by the renowned "">fraunhaufer institute as the best free mail service for spam management. This recognition clearly demonstrates that our new hadoop based strategy is working and working very well. This is just the tip of the success though. In the next 3 months, we are rolling out many of these new systems to our wide install base in the United States and I am eagerly waiting to see the effect this has on spam. It has been fun and exciting building these systems on top of hadoop but it has been even more exciting to see us winning the war on spam.
Join us next week for the Yahoo! keynote at Hadoop Summit 2010 to hear more about Hadoop and Mail Antispam.
Vishwanath Ramarao
Director of Anti-Spam Engineering
Yahoo! Mail
At Yahoo!, the ability to analyze and process enormous amounts of data is increasingly important. It’s a foundational layer for improving our consumer experiences and for sharing audience insights with advertisers.
In the last few years, I have been a part of a project to design, build, and run a low-latency, large-scale, distributed event data collection system at Yahoo!. When we started off, the goal seemed relatively unambitious, to collect web-access event data across all of the web-servers across all the data centers and bring it to a central location for processing. This perception soon changed after we realized that this involved around 20000 machines and over 20 data centers across the world amounting to over 40 billion events per day that helped fill-up over 10 TB of disk space. To add to the mix, the data had to be available within 15 minutes with an expected completeness of 99% across trans-oceanic fiber optic cable.
We decided to collect the data in a streaming fashion. This enabled us to feed the data at very low latencies to stream processing applications. However, there was an existing batch processing application that required all the data for the entire day to be available with near 100% completeness.
In order to achieve this, the data was collected in a streaming fashion and put into files that contained events belonging to that particular time period, the default being a minute and hence called minute files. Once the data was collected for the minute, the minute files were closed and the data was made available to the consuming application using a queue. Now, this worked reasonably well when there was one application but had problems when the consuming application wanted to reprocess or perform partial updates. In addition to this, the queue essentially kept the consuming application state making the collection and processing systems tightly coupled. This made is increasingly hard when it came to supporting multiple applications because the state for each batch for each application needed to be kept.
This made it even more interesting when the Hadoop initiative at Yahoo! began. All batch processing application were now running on the grid while the data was still consumed by legacy applications. How would we feed the old and the new systems with the same data without duplication?
What we needed was a loosely coupled or completely decoupled method of communicating the files to be processed to downstream batch processing applications. The solution we came up with was a simple but elegant one called a List of Files (LoF) repository.
The LoF repository contains an entry for each minute file collected and its associated attributes such as the start time, the end time, the size in bytes, the number of events, the collection pipeline instance name, and other relevant data. An API to access this data called the LoF API was provided to be able to query the repository for a set of files that satisfied certain attribute constraints. For example, a query might request “all the files that belong to the period 12:00 to 12:05 collected from the sports web servers”. The repository did not need to keep state of which files this application had processed or maintain any queue. This allowed the application to maintain its state and multiple applications were as simple as the single application case. To simplify the usage the API was made available in the form of a RESTful web-service.

Different applications had different completeness requirements from the data collected. For example, a low-latency behavioral targeting application would typically be happy with 95% of the data within 1 minute of the data, while a revenue realization or tracking application would need 100% of the data within 15 minutes. In order to support this, the API returned a completeness metric along with the list of files returned to indicate the percentage of data the list represented. The application could use this information to commence processing based on its own completeness requirements.
Given the distributed nature of the web-servers, data was often delayed or unavailable due to network outages or temporary host unavailability. This meant that applications requiring higher levels of completeness were routinely delayed beyond their SLAs. To solve this we provided a simple timestamp based cursor facility to enable incremental processing. The cursor was essentially returned with the list to indicate the timestamp at which the list was generated. The subsequent query would provide the previously returned cursor along with the subsequent query to indicate the time of the last fetch and the query would return all the files later that that timestamp.
This is what the web-service request looks like:
The response to this is of the form:
collector1.yahoo.com 1234388520 1234 /col1/1200.gz
collector2.yahoo.com 1234388520 3232 /col2/1200.gz
A subsequent request to get incremental data would use the cursor timestamp returned to fetch additional files as follows:
which would get a response similar to:
collector1.yahoo.com 1234388580 3232 /col1/1201.gz
Akon Dey
Architect, Event Data Collection System at Yahoo!
I’m happy to share the agenda for the upcoming Hadoop Summit – June 29th, Hyatt, Santa Clara.
We received over 70 great submissions for talks. It was a very impressive combination of development tool overviews, application case studies and innovative research.
We had the difficult task of selecting just a handful of presentations from this overwhelming collection of great quality abstracts and speakers. The variety of topics across numerous industry verticals, served as a clear evidence of how far this technology has evolved over the past year. Hadoop is really going mainstream!
Our goal was to create a diverse agenda that covers topics for experienced Hadoop users as well as people who recently began to explore this technology. We wanted to focus on the Hadoop eco-system of tools and solutions as well as “real life” users experience.
I want to thank all the people who submitted talks and encourage speakers that were not selected to submit their great presentations to our monthly Hadoop User Groups.
Detailed agenda and abstracts available at http://www.hadoopsummit.org/agenda.html
At Yahoo!, we are embracing Hadoop at the very core of our business. As you will hear at the summit, we are continuing to invest heavily in both the technology and the community to make it even better. We love being at the center of the discussion and debates around Hadoop and learning from other’s experiences.
We are planning to go bigger next year, with a broader event that will allow more opportunities for speakers as well as sponsors.
We hope you all join us at the summit – if you have not registered yet, please REGISTER TODAY!. Space is limited and we don’t want you to miss the opportunity to see the great variety of talks outlined above first-hand.
Special thanks to our Sponsors:
Dekel Tankel
Director, Product Management
Cloud Computing at Yahoo!
2 members
1 member
7 members
1 member
3 members
© 2010 Created by Jason Venner.
Powered by
.