I have read the earlier tip, Introduction to Microsoft Cortana Intelligence Suite and gained a fair understanding of the offering. I would like to know more about the Information Management components and what is offered.
Information Management is the first step among various steps involved in transforming data into Intelligent Actions using the Cortana Intelligence Suite. Let's look at the various offerings.
Overview of Information Management in Microsoft Cortana Intelligence Suite
Information Management is the Data Ingestion step in the process of Transforming Data into Intelligent Actions. This step is similar to the Data Ingestion step in a typical Data Warehouse and BI Architecture.
Information Management is comprised of the following three Azure offerings:
- Azure Event Hubs
- Azure Data Factory
- Azure Data Catalog
Azure Event Hubs in Microsoft Cortana Intelligence Suite
Azure Event Hubs is a data ingestion service, in the Azure Cloud, which can ingest millions of events per second from a variety of sources. It works in a pub-sub (Publish-Subscribe) fashion and is highly scalable enabling the collection of massive amounts of events and telemetry data. The following block diagram shows, at a high level, how a typical event hub works and how it fits between Event Generators and Event Consumers.
Here are the highlights of Event Hubs:
- Event hubs is a part of a Service Bus Namespace in Azure
- Each Event Hub contains partitions and there can be a minimum of 2 and maximum of 32 partitions which are independent of each other
- Data in partitions expires after a specified amount of time which is configurable at the Event Hub level and the same setting applies to all the partitions with the Event Hub
- Expired Events are automatically deleted from respective partitions and manual deletion of events from Event Hubs is not required
- Event Hub can also be used in a Lambda Architecture to feed two different processing engines - a real-time processing engine like Azure Stream Analytics and a batch processing engine like Azure HDInsight
Refer to the following resources to learn more about Azure Event Hubs:
Azure Data Factory in Microsoft Cortana Intelligence Suite
Azure Data Factory is a fully managed cloud based Data Integration and Orchestration service in the Azure Cloud. It is predominantly used for data movement and transformation similar to an on-premises ETL tool. It enables businesses to perform and automate large-scale data movement and transformation in the cloud.
Below are the components/activities/processes involved in a typical data analytics process/project. Azure Data Factory enables us to perform various activities in this process and orchestrate the process.
Azure Data Factory is comprised of the following 4 key components/concepts:
- Activity defines the task/action to be performed. Activities enable functionalities like data extraction, data movement, data transformation, and orchestration of these functionalities. For those of you coming from the SSIS world, think of an Activity as something analogous to a Task in SSIS. Activities can be of different types like Data Movement Activity, Data Transformation Activity, etc. and based on that an Activity can take zero or more dataset(s) as input and produce dataset(s) as output. Some of the common examples of Activity include a Copy Activity to Copy Data, Stored Procedure Activity to Execute Stored Procedure, Hive Activity to execute a Hive Query against an Existing or New (On-Demand) HDInsight Cluster.
- Dataset is a reference to the data that will be referenced/consumed or produced as part of an Activity. Common examples of Dataset include Tables, Files, etc. For folks from the SSIS world, think of a Dataset as something analogous to dataset/metadata info contained in a Source or Destination component in a Data Flow Task.
- Pipeline is a logical grouping of one or more activities. One or more Activities are grouped together and interconnected, based on the dependencies, to perform a prescribed task and to achieve a common goal. For folks from the SSIS world, think of a Pipeline as something analogous to a Control Flow containing various tasks interconnected with Precedent Constraints.
- Linked Service contains/defines the necessary information required to connect to external resources like data stores, HDInsight clusters, etc. Depending upon how a Linked Service is used, a Linked Service can be of two types:
- A Linked Service which contains information necessary to connect to a Data Store. One or more Datasets can be defined by referencing data contained within a Linked Service holding connection information of a Data Store. For folks from the SSIS world, think of it as something analogous to a Connection Manager to a Data Source/Destination.
- A Linked Service which contains information necessary to connect to a Compute Resource that hosts an Activity execution like connection information of an HDInsight Cluster which will be used for executing Hive/Pig queries/scripts etc.
Here are the highlights of Azure Data Factory:
- Can connect and work across various sources/systems including the ones in the cloud, SaaS applications, and on-premises systems
- Enables big data processing and predictive analytics by connecting to HDInsight cluster(s) and Azure Machine Learning respectively
- Offers a dedicated monitoring and management user interface to effectively monitor the jobs, setup alerts, view lineage, re-run failed jobs, and lot more
- Offers a JSON syntax for authoring and can be authored in different ways including Visual Studio, Azure Portal, and Azure PowerShell
- Available as a Service in some Select Regions
Refer to the following resources to learn more about Azure Data Factory:
- Azure Data Factory Pricing
- Getting Started with Azure Data Factory
- Azure Data Factory Learning Path
- Azure Data Factory Frequently Asked Questions
- Azure Data Factory Customer Case Studies
Azure Data Catalog in Microsoft Cortana Intelligence Suite
Azure Data Catalog is a fully managed cloud service which enables businesses to manage and discover the data sources relevant to the enterprise and all the related information about the data sources.
Azure Data Catalog enables us to maintain a lot of useful information about data sources including, but not limited to the following:
- List of Data Sources
- Metadata Information
- Sample Data for Preview
- Connection Information
- Owner/SME Contact Information
- Other relevant information/documentation about data source
Below is a block diagram showing what all an Azure Data Catalog is comprised of and how users interact with it.
Here are few highlights of Azure Data Catalog:
- Enables the users to manage (Publish, Update, or Delete) and discover the data sources thereby reducing the time required to discover the data source(s) and its relevant information using the traditional approach of reaching out to different people within the organization
- Enables the users to gain understanding of the data sources which will better equip them with the necessary information before deciding to consume or before actually consuming the data
- Enables the users to decide on whether or not to consume and subsequently to consume the necessary data through appropriate channels and connection information based on the privileges the user is entitled to
- Enables the users to contribute to the data catalog by adding missing information and correcting/enhancing the existing information about data sources thereby making the information more relevant, more accurate, and useful to broader audience within the enterprise
- Can be useful for users in different roles in an organization including Data Stewards, Business Analysts, Data Developers, and any other users who need to work with the data
- Provides necessary security controls to ensure that the information is viewed and data is accessed by only those users who have the necessary privileges
- Offers REST APIs for integrating with the existing tools and applications
Refer to the following resources to learn more about Azure Data Catalog:
- Azure Data Catalog Pricing
- Getting Started with Azure Data Catalog
- Azure Data Catalog Frequently Asked Questions
- Azure Data Catalog Common Scenarios
- Azure Data Catalog Supported Data Sources
- Sign up for free trial Azure Subscription, if you don't have one already, and start giving the above services a try.
- Stay tuned to learn more about the other major components of the Cortana Intelligence Suite
Last Update: 8/24/2016
About the author
View all my tips