Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Scripts          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Getting Started with HDInsight - Part 1 - Introduction to HDInsight


By:   |   Last Updated: 2014-11-26   |   Comments (7)   |   Related Tips: More > Big Data

Problem

I have read the previous tips on Big Data Basics series of tips and have an overview of Big Data, Hadoop, and the related concepts. My current employer is more of a Microsoft shop and we use Microsoft technologies day-in and day-out. My employer is planning to implement one of our upcoming projects using Hadoop. So I would like to know about Microsoft's Hadoop offering and get prepared before the start of the project.

Solution

In this tip, we will take a look at Microsoft's Hadoop offering and get an overview of Microsoft's Hadoop Distribution.

Microsoft has a Hadoop Distribution known as HDInsight. Like other distributions available in the market, HDInsight contains Apache Hadoop as its core engine.

Hadoop vs. HDInsight Architecture

Typical Hadoop Architecture

As we have seen in the Big Data Basics Tip Series, here is a typical architecture of Apache Hadoop.

Typical Hadoop Architecture

Source - Big Data Basics - Part 3 - Overview of Hadoop

Here are few highlights of Apache Hadoop Architecture:

  • Hadoop works in a master-worker / master-slave fashion.
  • HDFS (Hadoop Distributed File System) offers a highly reliable storage and ensures reliability, even on commodity hardware, by replicating the data across multiple nodes.
  • MapReduce offers an analysis system which can perform complex computation on large datasets.
  • Master contains the Namenode and Job Tracker components.
    • Namenode holds the information about all the other nodes in the Hadoop Cluster, files present in the cluster, constituent blocks of files and their locations in the cluster, and other information useful for the operation of Hadoop Cluster.
    • Job Tracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results.
  • Each Worker / Slave node contains the Task Tracker and Datanode components.
    • Task Tracker is responsible for running the task assigned to it.
    • Datanode is responsible for holding the data.
  • The nodes present in the cluster can be present in any location and there is no dependence on the location of the physical node.
  • Apache Hadoop is at the core of various Hadoop Distributions available on the market.

Source - Big Data Basics - Part 3 - Overview of Hadoop

Typical HDInsight Architecture

Here is a typical architecture of Microsoft HDInsight.

Typical HDInsight Architecture

Here are the highlights of Microsoft HDInsight Architecture:

  • HDInsight is built on top of Hortonworks Data Platform (HDP).
  • HDInsight is 100% compliant with Apache Hadoop.
  • HDInsight is offered in the cloud as Azure HDInsight on Microsoft's Azure Cloud.
  • HDInsight is tightly integrated with Azure Cloud and various other Microsoft Technologies.
  • HDInsight can be installed on the Windows operating system unlike the majority of the distributions which are based on the Linux operating system.
  • HDInsight also works in a master-slave fashion. Master / Control Node (Head node) controls the overall operation of the Cluster. A Secondary Head node is also included as part of Azure HDInsight deployments.
  • HDInsight can be configured to store the data either on HDFS within HDInsight cluster nodes or on Azure Blob Storage. The most common approach is to use Azure Storage to store the data, intermediate results, and the output and not store data on individual nodes.
  • User data (Data to be processed) and job metadata resides in Windows Azure Storage - Blob (WASB). WASB is an implementation of HDFS on Azure Blob Storage.
  • Individual nodes of the cluster offer MapReduce functionality.
  • The Master node reads the job metadata and the user data from Blob Storage and uses it to do the processing. The intermediate and the final results are stored in the Blob Storage.

Storing Data on Azure Blob Storage - Few Highlights

This unique architecture of HDInsight, which stores the data on Azure Blob Storage, offers various advantages over other Hadoop distributions. Here are few highlights of this unique architecture:

  • Cluster can be provisioned and destroyed as and when required and the data is still available on the Blob Storage even after the cluster is destroyed. This is highly cost effective as we don't need to keep the cluster active to access the data.
  • Storing the data on Blob Storage, which is a common storage, allows other tools/processes to access/use this data for other processing, reporting, etc.
  • Azure Storage Account (associated with the Cluster) and the HDInsight Cluster (to be created) should be located in the same data center.
  • Microsoft implements Azure Flat Network Storage technology to offer a high speed connectivity between WASB and compute nodes.

Different ways of deploying Hadoop on Microsoft's Azure Cloud

There are couple of different ways to deploy HDInsight in Microsoft Azure Cloud.

Manual Deployment

We can manually deploy and configure Hadoop on Microsoft's Azure Cloud. Here are few highlights of this approach:

  • Provision Virtual Machines on Azure (IaaS). VMs with any kind of operating system, depending upon the Hadoop distribution, can be provisioned.
  • Install Hadoop on those virtual machines. Core Hadoop or other Hadoop distributions can be installed.
  • Configure those VMs as a Hadoop cluster.
  • This manual process of deploying Hadoop involves a lot of time and effort and is prone to errors unless worked on by experienced Hadoop Administrators.

Deployment using Azure HDInsight

We can deploy Hadoop as Azure HDInsight in a few simple, proven, and tested steps defined by Microsoft, which takes care of all the necessary configurations, Hadoop version management, etc. Here are few highlights of this approach:

  • Easy to Deploy - We can deploy the cluster in less time with tested and proven methods / steps defined by Microsoft (via the Azure portal or through Windows PowerShell).
  • Provision a cluster with the required number of nodes, depending upon processing needs, and destroy the cluster when you are done. This way, you pay only for what you use.
  • By using Azure Storage to store the data, one can retain the data even after the cluster is destroyed unlike the other distributions where the data resides on the data nodes and cluster should be available to access that data.
  • The data stored in Azure storage can be accessed / used by other processes / applications as appropriate.
  • With Azure's flexibility, one can easily scale up / down the resources depending upon the usage / need.

Now that we have built the foundation by understanding the basics of Microsoft HDInsight offering, we will start exploring other aspects of this distributions in the next tips in this series. Stay tuned!

Next Steps


Last Updated: 2014-11-26


get scripts

next tip button



About the author
MSSQLTips author Dattatrey Sindol Datta has 8+ years of experience working with SQL Server BI, Power BI, Microsoft Azure, Azure HDInsight and more.

View all my tips
Related Resources




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Monday, December 08, 2014 - 1:51:36 AM - Dattatrey Sindol (Datta) Back To Top

Hi Sridevi,

If you can provide more details on your scenario, I would be able to provide some recommendations/inputs.

Best Regards,

Dattatrey Sindol (Datta)


Saturday, December 06, 2014 - 5:36:10 AM - Sridevi Back To Top

Hi Datta,

 

We are trying to create SSAS(data cube) using HDInsight environment. Have you came accross a scenario like this. If yes, your suggestions would really help us. Please share it.

 

Regards,

Sridevi.


Friday, December 05, 2014 - 11:55:06 PM - Sridevi Back To Top

Hi Datta,

 

Yes. Thank you..

 

Regards,

Sridevi.


Wednesday, December 03, 2014 - 1:46:07 PM - Dattatrey Sindol (Datta) Back To Top

Hi Sridevi,

In HDInsight cluster, when we store the data on Blob Storage, Storage Account's redundancy options are used by cluster as well.

In a typical HDInsight implementation, wherein data is stored on the Blob Storage, all your data nodes access the data from the same location - Windows Azure Storage Blob (WASB). However, if you plan to deploy HDInsight to use local stoage, in which case the individual compute nodes will store data on them and the regular hadoop principles apply.

Since the data is stored in WASB in a typical HDInsight implementation, data nodes access the data based on the location specified by the user and/or the metadata maintained by HDInsight.

Hope that answers your questions. Stay tuned!

Best Regards,

Dattatrey Sindol (Datta)


Monday, December 01, 2014 - 11:56:58 PM - Sridevi Back To Top

Hi Datta,

Thanks for the clarification.

In hadoop cluster, data will be replicated and stored accross cluster nodes for fault tolerance.

In HDInsight cluster what happens actually? 

Storage accounts have redundancy options. That is used by the cluster as well?

If i am using HDInsight cluster with 1 namenode and 2 datanodes. One storage account with a container in it. That one container will be accessed by all the datanodes while processing with Pig/Hive? Datanodes don't have storage seperately?

What about the blockreports all all those that hadoop namenode maintains - in case of HDInsight cluster?

Thanks,

Sridevi.

 


Monday, December 01, 2014 - 12:37:20 PM - Dattatrey Sindol (Datta) Back To Top

Hi Sridevi,

 

HDInsight requires one one container designated as its root container where it can store the system files. It uses this root container to store the data as well by default. However, if you want to store the data on different containers and access that data, you can do that as well. HDInsight allows us to access data present not only on different containers but also data present in different storage accounts.

Thank you and stay tuned for future tips.

 

Best Regards,

Dattatrey Sindol (Datta)


Monday, December 01, 2014 - 5:07:51 AM - Sridevi Back To Top

Hi

I have a doubt. While creating HDInsight cluster, only one container is created. That conainer will be accessed by all the datanodes in the cluster?

In the "typical architecture of HDinsight" you mentioned above, datanodes access data from different conatiners. Can you please explain me that?

Thanks,

Sridevi


Learn more about SQL Server tools