Azure Data Catalog Tutorial and Overview

Problem

As a Business Intelligence (BI) developer you have just deployed a Data Model on Azure Analysis Services along with sourcing your Tabular Data Model through Azure SQL Database. Your organization has now asked you to help them in data discovery. Assuming there are loads of data models to be deployed soon which is going to make it a bit difficult to understand what data lies in which model.

This is particularly important when your organization has heavily invested in a business intelligence solution and also looking forward to getting speedy discovery of its enterprise data assets apart from BI powered analysis and reporting.

Solution

The solution is to register your Azure Analysis Services databases (data models) with Azure Data Catalog to make the data understandable and discoverable.

Azure Data Catalog Overview

Let’s do an overview of Azure Data Catalog and some of the key terms directly or indirectly used to describe it.

About Azure Data Catalog

According to Microsoft documentation Azure Data Catalog is a fully managed service on the cloud which helps in the following things:

Data Discovery
Data Understanding
Consuming Data

In other words, Azure Data Catalog is a central repository which contains information about data sources registered with it, like a phone index.

Azure Data Catalog facilitates quick discovery of data sources as well as understanding the use of those data sources.

Azure Data Catalog also serves as a single centralized location for metadata (data sources) contribution by all the organization.

Let’s go through some key terms used with Azure Data Catalog:

Data Discovery

Data discovery in the context of Azure Data Catalog is to ensure that data sources are discoverable by all the data users who are permitted to discover the data.

Data Understanding

Data understanding means that the users who discover the data source(s) also understand it because finding a data source is not simply enough, one also needs to understand it.

Data Consumption

The purpose of discovering and understanding data is not complete if it cannot be consumed. So, Azure Data Catalog helps the data users to consume data sources in different ways.

Data Users (Producers and Consumers)

Data user can be any person, including a business intelligence developer, data analyst and data scientist who is interested in discovering and understanding the organization’s data assets to meet business needs.

The data users either produces data or consumes data or does both.

We can group data users into the following two types:

Data Producers – Data producers are responsible for producing data by creating and managing processes to maintain data (data sources).
Data Consumers – Data consumers are those who would like to consume the data (which has been made available by data producers) using the reporting or analysis tool of their choice

It is possible for someone to be both a data producer and consumer at the same time.

Tribal Knowledge vs. Azure Data Catalog

Let us now compare a not so common term "Tribal Knowledge" with Azure Data Catalog in the light of Microsoft documentation.

We cannot compare tribal knowledge with Azure Data Catalog unless we are familiar with both.

Tribal Knowledge

A very good example of tribal knowledge is to understand a particular scenario when an employee joins a new organization.

Despite the fact that he/she has relevant skills, experience and education, a lot of time is required to understand the business domain which includes enterprise data assets including understanding the data sources.

So, in other words the new employee lacks tribal knowledge in the beginning while the experienced ones are quite knowledgeable in that context.

Tribal knowledge is the knowledge of a particular domain (business) which includes data assets such as data sources their discovery, understanding and consumption.

Obtaining tribal knowledge is typically a slow process and often requires information gathering and analysis from different departments before getting the know how of the domain, including data assets.

According to Microsoft documentation, getting tribal knowledge about data assets so that they can be discovered, understood and consumed is a very challenging process both from data consumer and data produce perspective.

Learning Curve for Data Consumer (Tribal Knowledge)

The first-time data consumer requires the following things to know before data can be consumed:

Existence of the Data Source

For example, it might be possible that the required data source does not exist or a number of data sources might exist.

Location of the Data Source

Even if first time data consumer somehow discovers that the required data source does exist then the next challenge is to find the location of the data source and also how to connect to it to use it with the desired client tool.

Intended use of the Data Source

Knowing the location is not enough the data consumer must understand the intended use of the data source.

Locating Documentation

The data user cannot understand the intended use of the data source completely unless its documentation is found.

Locating Data Expert

The first time data consumer requires the help of a data expert to get familiar with the information asset.

Process to Access Data Source

Most importantly, the data consumer has to understand the process to get access to the data source, because if access to the required data source is restricted then this data source is of no use to the data user.

Data Assets Documentation Demand for Data Producers

Experienced data producers also face a very high demand to document enterprise data assets despite the fact that getting tribal knowledge is not an issue for them, but they have their own concerns when it comes to documenting data sources.

Data producers face the following issues:

Documentation Synching

One of the biggest challenges is to keep the data source documentation in sync with the data source usage which requires consistent review and often left out is a sophisticated documentation management system.

Data Description embedded with Data Sources

Embedding data description with data sources is possible, but this gets easily ignored by the client applications consuming the data sources and this is very hard to standardize the data description with the data source.

Data Source Access

Data producers are also responsible to determine enterprise data assets access and the standard procedure needed to request access to a certain data source which is very difficult to document and often not found by data consumers.

Azure Data Catalog, being a central repository to manage data assets including their description and other forms of documentation along with data sources access information, addresses the above mentioned concerns faced by both data consumers and data producers as part of the database lifecycle management.

Data Assets and Data Annotation in Azure Data Catalog

Let’s get more understanding of data assets and data annotation in Azure Data Catalog.

About Enterprise data assets

As already mentioned, data assets typically means data sources.

The data assets (data sources) are not restricted to analysis services databases only, in fact they belong to any database system which is part of the organization.

The data assets registered with Azure Data Catalog can be from any database system including the following:

Line-of-Business data sources
Online Transaction Processing (OLTP) data sources
OLAP (Online Analytical Processing) data sources
Business Intelligence/Analytics data sources

Although the users mostly require data sources for analysis and reporting purposes, this can be for any other task such as application development, data science and even database development.

Data Annotation and Data Access

As per Microsoft documentation, data sources can be annotated in Azure Data Catalog which means you can add the following things:

Tags
Description
Documentation
Information of the data controller
Information to access the data source

Anyone can enrich the metadata of a data source or even register a data source to be discovered, annotated and consumed.

Azure Data Catalog in Database Lifecycle Management (DLM)

According to Microsoft documentation, Database Lifecycle Management (DLM) is a policy based approach to managing databases and data assets and this approach is applied right from the beginning of the database development process.

Database Lifecycle Management (DLM) defines the standards to be followed throughout the life of a database and data assets.

The basic Database Lifecycle Management (DLM) approach excluding the post-deployment scenario is applied in the following stages:

Database Design
Database Development
Database Testing
Database Code Build
Database Deployment

Now that we are familiar with the benefits of registering data sources with Azure Data Catalog it is highly advisable to add Data Assets Registration with the Azure Data Catalog after the deployment stage as follows:

Data Assets Registration with Azure Data Catalog

This is illustrated as follows:

This ensures that every time a database deployment takes place it also gets registered with Azure Data Catalog to instantly become beneficial (discoverable, understandable and consumable) for both data consumers and data producers.

In the next part of this article (to be published soon) the readers are going to learn the standard steps of registering a data asset with Azure Data Catalog to get practical understanding of the concepts discussed in this article.

Next Steps

Since this article is based on the conceptual understanding of Azure Data Catalog, before the next part of this article arrives let’s take an example of a scenario where you have to convince your top management that Azure Data Catalog is a must for your organization. Here are some questions to answer:

Why do we need Azure Data Catalog when we only have few databases?
Our databases are already well documented, so what is the point of getting them registered with Azure Data Catalog?
We have a dedicated reporting team? Why would we need to know about the location of the data sources if our reporting team can instantly create reports for us on demand?
How are you sure that Azure Data Catalog can successfully replace the understanding one gets with time through Tribal Knowledge?
We are a small team and we can quickly chat with any team member to ask about any data asset, so why we should spend time and effort on Azure Data Catalog data assets registration?

You are welcome to answer any or all of the above questions by posting your comments in the comments section below.

Haroon Ashraf

Haroon Ashraf’s deep interest in logic and reasoning at an early age of his academic career paved his path to become a data professional.

He holds BSc (Gold Medal) and MSc Degrees in Computer Science and also received the OPF merit award.

His programming career began in 2006 working on his first data venture to migrate and rewrite a public sector database driven examination system from IBM AS400 (DB2) to SQL Server 2000 using VB 6.0 and Classic ASP along with developing reports and archiving many years of data.

His work and interest revolves around Business Intelligence and Database Centric Architectures and his expertise include database and reports design, development, testing, implementation and migration.

He has recently earned “Knowledge Management and Big Data in Business” certificate from The Polytechnic University of Hong Kong.

MSSQLTips Awards: Trendsetter (25+ tips) – 2020 | Author of the Year Contender – 2018-2020

Azure Data Catalog Tutorial and Overview – Part 1