-: Cloudera Hadoop

Cloudera Hadoop Offers:

§ HDFS – Self healing distributed file system

§ MapReduce – Powerful, parallel data processing framework

§ Hadoop Common – a set of utilities that support the Hadoop subprojects

§ HBase – Hadoop database for random read/write access

§ Hive – SQL-like queries and tables on large datasets

§ Pig – Dataflow language and compiler

§ Oozie – Workflow for interdependent Hadoop jobs

§ Sqoop – Integrate databases and data warehouses with Hadoop

§ Flume – Highly reliable, configurable streaming data collection

§ Zookeeper – Coordination service for distributed applications

§ Hue – User interface framework and SDK for visual Hadoop applications

Architecture of Hadoop:

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

Name Node:

NameNode manages the namespace, file system metadata, and access control. There is exactly one NameNode in each cluster. We can say NameNode is master and data nodes are slaves. It contains all the informations about data (i.e. the meta data)

Data Node:

DataNode holds the actual file system data. Each data node manages its own locally-attached storage (i.e. the node’s hard disk) and stores a copy of some or all blocks in the file system. There are one or more DataNodes in each cluster.

Install / Deploy Hadoop:

Hadoop can be installed in 3 modes

1. Standalone mode:

To deploy Hadoop in standalone mode, we just need to set path of JAVA_HOME. In this mode there is no need to start the daemons and no need of name node format as data save in local disk.

2. Pseudo Distributed mode:

In this mode all the daemons (nameNode, dataNode, secondaryNameNode, jobTracker, taskTracker) run on a single machine.

3. fully-distribute mode:

In this mode, daemons (nameNode, jobTracker, secondaryNameNode(Optionally)) run on master (NameNode) and daemons (dataNode and taskTracker) run on slave (DataNode).

Besides Apache Hadoop, it's more or less a three horse race for Hadoop distribution betweenHortonWorks, Cloudera and MapR. Then there are GreenPlum HD and IBM InfoSphere BigInsights.

In Apache all the projects (Pig, Hive etc) are independent. Cloudera makes sure all these frameworks work properly with each other and packages them as CDH. With CDH there are regular release, which I haven't seen in Apache. Another thing is it's difficult to get support for Apache Hadoop, while Cloudera and others provide commercial support for their own versions of Hadoop.

CDH version:

Versions:

· CDH3u6

· CDH4.1.3

· CDH4.2.0

-

Monday, December 17, 2012

Cloudera Hadoop

RedBus2US