A Community for Hadoop Users
Introduction
Hadoop is rapidly becoming the technology of choice for enterprises that need to effectively collect, store and process large amounts of structured and complex data.
The Pentaho BI project is open source application software for enterprise reporting, analysis, dashboard, data mining, workflow and ETL capabilities for business intelligence needs. Business Intelligence is a process for increasing the competitive advantage of a business by intelligent use of available data in decision making. The Pentaho Enterprise BI Suite delivers a unified visual design environment for ETL, report design, analytics and dashboards, providing an enterprise friendly environment for using Apache Hadoop.
Advantages of Pentaho Data Integration with Hadoop
Pentaho is empowered with SQL querying using Hive which uses graphical rich interface enables to move data files into and out of HDFS
Graphical design of new MR jobs taking advantage of vast library of pre-built mapping and data transformation steps.
Extract data from the HDFS and load into external database through Hive.
Steps involved in Integration
Coming to the technical aspects of integration process, Pentaho considers data to be stored in data lakes which can be accessed by the Hive for querying and also by using ad-hoc approach.
A file is to be extracted from the data lake through a ftp site which is very secured.
Make sure that this file doesn't exist in Hadoop and HDFS(Hadoop Distribued File System) and copy the file into HDFS.
As a Hadoop process it maintains the several copies of the file through its replication feature for avoiding data loss and efficient performance and analysis.
This file has to be processed or analyzed using MR(Map-Reduce) technique.
Hive- a data warehouse provides tools to enable easy data ETL, a mechanism to put structures on the data and the capability of querying and analysis of large sets stored in Hadoop files.
Query Language(QL))(provided by Hive), queries are executed using MR queries and can be controlled by Hadoop configuration variables.
The processed data is passed through Pentaho Data Integration(PDI) for performing BI to meet business needs. Metadata is maintained in all stages of integration to avoid data loss.
Data Movement
Interestingly there are few aspects which we need to look into, like the data movement between HDFS and the database for structured data, when MR needs to perform on a database as vast amounts of data are increasing exponentially and integrating with the database has been a tremendous improvement in terms of Business Intelligence and Hadoop plays a vital role in performing. Though there are few aspects which need to be highlighted.
External tables present data stored in a file system in a table format and can be used in SQL queries transparently. External tables can be used to access data stored in HDFS from inside the Oracle database. But HDFS is not accessible directly through normal operating system requests, and so we rely on for ease of data movement in and out of HDFS. Though Apache Hive helps us in retrieving the data from the database as and when needed but let us have a look into the below concepts
Use of Coherence and Distributed Cache
Making a large flat file of the database.
1. Coherence & Distributed Cache
Coherence provides replicated and distributed (partitioned) data management and caching services on top of a reliable, highly scalable peer-to-peer clustering protocol. Coherence has no single points of failure like Hadoop and it automatically redistributes its clustered data management services when a server becomes inoperative or is disconnected from the network. When a new server is added, or when a failed server is restarted, it automatically joins the cluster and Coherence redistributes the cluster load to the server. Coherence includes network-level fault tolerance features.
Hadoop provides a Distributed Cache mechanism which makes use of this Coherence concept to perform its tasks assigned by the client. This provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. Generally to save network bandwidth files are normally copied to particular node once per job.
Though there are few points to consider while performing like, as Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling and latency for Hive queries generally very high even when data sets involved are very small. So it cannot be compared with Oracle where analysis are conducted on a significantly smaller amount of data but the analysis proceed much more iteratively with the response times between iterations being less than a few minutes.
2. Flat File of Database
Coming to the next scenario of flat file database is a database designed around a single table. The flat file design puts all database information in one table, or list, with fields to represent all parameters. A flat file may contain many fields, often, with duplicate data. Flat files offer the functionality to store information, manipulate fields, print or display formatted information and exchange information. Building a relational database is dependent upon user ability to establish a relational model. The model must fully describe how the data is organized, in terms of data structure, integrity, querying, manipulation and storage. Relational databases allow you to define certain record fields, as keys or indexes, to perform search queries, join table records and establish integrity constraints. Search queries are faster and more accurate when based on indexed values. Table records can be easily joined by the indexed values.
Hadoop has to be explored much more in ETL terms for moving the data in the next level to tackle the evergrowing business data in an orderly form though this is an introduction to this wide concept.
6 members
4 members
11 members
1 member
9 members
© 2012 Created by Jason Venner.
You need to be a member of Hadoop Professionals to add comments!
Join Hadoop Professionals