Big Data Basics - Part 7 - Hadoop Distributions and Resources to Get Started

By:   |   Comments (1)   |   Related: More > Big Data


Problem

I have read the previous tips in the Big Data Basics series and I want to start working with Hadoop. What are some of the Hadoop distributions (different versions) and do you have any pointers on how to get started?

Solution

In this tip we will take a look at some of the major Hadoop distributions available in the market and also resources on getting started.

Introduction

There are various providers in the market which are offering their own version of Hadoop. So in this tip we will cover the following questions:

  • What are the Hadoop distributions?
  • How are the Hadoop distributions different from one another?
  • Which distribution should I choose?

What are the Hadoop distributions?

Hadoop distributions available in the market are built on top of the open source Hadoop framework. The core component in these distributions is the same open source Hadoop framework built by the Apache foundation and is still distributed as open source.

Hadoop contains various components/sub-projects which have their own release cycles and it's a pretty complex ecosystem with so many projects. These distributions manage/integrate the required versions/dependencies of these projects so that enterprises can focus on the real problem at hand. These distributions ensure that, they contain a stable version of the Hadoop project with all the necessary patches along with their own proprietary components.

How are the Hadoop distributions different from one another?

These different distributions include various components built on top of the core Hadoop engine which make deployment, management and maintenance of Hadoop (and other projects within the Hadoop ecosystem) simpler, faster, and more efficient. In most cases, there are additional components being built by the providers/distributors which are proprietary to respective distributions and are provided at a cost.

Which distribution should I choose?

Every distribution has its own pros and cons. So the answer is, it depends. There are various aspects to be considered including cost, simplicity, performance, support, documentation, cloud/on-premises versions, and so on. A detailed comparison of various distributions is out of scope for this tip.

Let's take a look at some of the popular Hadoop distributions available in the market to help guide your decision.


Azure HDInsight

Azure HDInsight is Microsoft's distribution of Hadoop. The Azure HDInsight ecosystem includes the following features/components: Pig, Hive, Hbase, Sqoop, Oozie, Ambari, Microsoft Avro Library, YARN, Cluster Dashboard and Tez.

Apart from the above listed features/components, there are a few other components which enable reporting and analytics on top of data present in Azure HDInsight. These components include the following:

More information: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-introduction

Here are few highlights of Azure HDInsight:

  • Azure HDInsight is based on Hortonworks Data Platform.
  • Azure HDInsight enables Apache Hadoop as a service in Microsoft Azure cloud thereby leveraging all the benefits of cloud computing.
  • Azure HDInsight offers strong support for PowerShell via HDInsight PowerShell Cmdlets.
  • Windows Azure and HDInsight PowerShell Cmdlets can be used to perform various activities including uploading, downloading, movement of data to and from Azure Blob Storage and On-Premises file systems, configuring/executing/post-processing jobs on HDInsight, and other related activities.
  • Azure HDInsight being a Hadoop service in the cloud, one can provision a cluster, process the data, and destroy the cluster and pay for only the resources used.
  • Microsoft also offers an HDInsight Emulator which allows developers to explore HDInsight on-premises without requiring an Azure Account.

Links & Additional Information

Getting Started


Cloudera

Cloudera was the first company to be formed to build enterprise solutions based on Hadoop. Cloudera has a Hadoop distribution known as Cloudera's Distribution for Hadoop (CDH). Here is a simplified representation of Cloudera's Hadoop Ecosystem.

Cloudera's Hadoop Distribution

Source: http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html

Cloudera's Hadoop Ecosystem includes the following features/components: Apache Avro, Apache Crunch, Apache DataFu, Apache Flume, Apache Hadoop, Apache Hbase, Apache Hive, Hue, Cloudera Impala, Kite SDK (formerly CDK), LLAMA, Apache Mahout, Apache Oozie, Parquet, Apache Pig, Cloudera Search, Apache Sentry, Apache Spark, Apache Sqoop and Apache ZooKeeper.

More Information: http://www.cloudera.com/content/dev-center/en/home/developer-admin-resources/cdh-components.html

Here are few highlights of CDH:

  • CDH can be deployed on-premises as well as in the cloud.
  • Cloudera manager simplifies the deployment and management of Hadoop and other components in Cloudera's Hadoop Ecosystem.
  • Cloudera has an Enterprise edition - Cloudera Enterprise, and is proprietary. There three variations of this - Basic, Flex, and Data Hub.
  • Express edition is available via a free download.
  • Cloudera Enterprise Data Hub edition is supported on AWS cloud.

Links & Additional Information

Getting Started


Hortonworks

Hortonworks has a Hadoop distribution known as Hortonworks Data Platform (HDP). Here is a simplified representation of Hortonworks Data Platform.

Hortonworks Data Platform

Source: http://hortonworks.com/hdp/

Hortonworks Data Platform includes the following features/components: Apache Hadoop, Apache Pig, Apache Hive, Apache Hbase, Apache ZooKeeper, Apache Oozie, Apache Sqoop, Apache Flume, Apache Ambari, Hue, Apache Mahout, Apache Knox, Apache Storm, Apache Tez, Apache Phoenix, Apache Accumulo and Apache Falcon.

More Information: http://hortonworks.com/hadoop/

Here are few highlights of Hortonworks Data Platform:

  • Can be deployed on-premises as well as in the cloud.
  • Supports deploying on Linux as well Windows platforms.
  • HDP is built in open through Apache Projects.

Links & Additional Information

Getting Started


Amazon Elastic Map Reduce (EMR)

Amazon Web Services (AWS) Elastic MapReduce (EMR) was among the first Hadoop offerings available in the market. Here is a high-level architecture/job flow of Amazon EMR.

Amazon's Elastic MapReduce (EMR)

Source: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html

Amazon EMR contains most of the popular features/components like Hive, Pig, HBase, DistCp, Ganglia, etc. integrated into it.

Here are few highlights of Amazon EMR:

  • EMR is a Hadoop distribution in the Cloud.
  • Leverages AWS's Elastic Compute Cloud (EC2) for computation.
  • Leverages AWS's Simple Storage Service (S3) for storage.
  • Is tightly integrated with other AWS services.
  • Deployment and Management is simplified using AWS Management Console and AWS Toolkit.

Links & Additional Information

Getting Started


MapR

MapR is another major distribution available in the market. Below is a simplified architecture of MapR Data Platform.

MapR Data Platform

Source: http://www.mapr.com/products/product-overview/overview

Here are few highlights of MapR:

  • MapR is available in the cloud through some of the leading cloud providers - Amazon Web Services (AWS), Google Compute Engine, CenturyLink Technology Solutions, and OpenStack.
  • MapR integrates/supports more than 20 open source projects.
  • MapR supports multiple versions of various individual projects it integrates into its data platform. This gives the users flexibility to migrate to the subsequent/latest versions at their own pace.

Links & Additional Information

Getting Started

Apart from the distributions listed above, there are various other distributions available in the market from leading providers like Intel, Oracle, HP, and many others.

References

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Dattatrey Sindol Dattatrey Sindol has 8+ years of experience working with SQL Server BI, Power BI, Microsoft Azure, Azure HDInsight and more.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Sunday, September 21, 2014 - 9:56:03 PM - Sam Back To Top (34646)

Hello,

 

Thank you for this wonderful articles on Big Data Basics. I enjoyed reading them.

 

BTW, do you have PDF version of all the articles that you may wish to share with me. I will feel much oblighed.

 

Thank you. Have a nice day.

 

Regards,

SD

 















get free sql tips
agree to terms