Transition from Traditional Data Warehouse to Cloud Data Lakehouse

By: Ron L'Esteve | Updated: 2022-09-13 | Comments (1) | Related: > Cloud Strategy

Problem

Organizations that own established legacy data platforms undergo a mindset shift as they adopt modern cloud data Lakehouse platforms. Traditional data warehouse platforms are highly mature big data storage technologies that deliver numerous benefits, including the capability to record data in an ACID-compliant manner and ensure high levels of integrity. While the traditional data warehouse platform has served the industry well for optimal data analytics and business intelligence use cases, it may lack the flexibility to process semi-structured and unstructured data. It can be expensive to implement and maintain. Data providers and consumers of the traditional data warehouse have a number of disadvantages to these traditional platforms. Data scientists struggle to conduct advanced analytics use cases, while DBAs find it challenging and time intensive to maintain these platforms, and business stakeholders cringe at the monthly bills, which can be extremely expensive.

The Data Lakehouse platform solves a number of these challenges. By decoupling storage from compute, cost efficiency can be achieved on the Lakehouse platform. With its object storage capabilities for seamlessly managing semi-structured and unstructured data files, the Lakehouse platform is well suited for advanced analytics use cases such as AI and ML. Built on distributed cloud computing platforms, this Lakehouse platform reaps the benefits of pay-as-you-go and pay-per-query pricing models. In addition to solving the age-old challenges faced by traditional data warehousing systems, the modern data Lakehouse also brings with it the mature capabilities of the traditional data systems such as ACID compliance, indexing, partitioning, optimization techniques, meta-data management, querying with standard ANSI SQL language and more. Since the data Lakehouse is not a relational data storage system, it has some differences in data management and security. Since Data Lakes, the underlying storage technology for the Lakehouse platform, take a file storage-based approach to recording data, they have different data and security management strategies. As an organization's Lakehouse platform implementation takes flight, they are frequently interested in deepening their understanding of how they can design and implement a Data Lake management and Security Strategy for their organization.

Solution

The Data Lakehouse paradigm continues to win over numerous organizations as they embark on their digital innovation and transformation journey to modernize their big data and advanced analytics platforms to enable growth and value-added business outcomes. Components of this Modern Lakehouse, such as Apache Spark, are winning prestigious awards, such as the recent one from the Association of Computing Machinery's Special Interest Group in the Management of Data in the industry. These awards are presented to systems whose technical contributions have significantly impacted the theory or practice of large-scale data management systems. As a result of this and other continued upticks in Lakehouse innovations, numerous cloud providers, vendors, and solution integrators are supporting this Lakehouse-driven upward cloud adoption trajectory by building products and providing service solutions to design and implement the Modern Data Lakehouse Platform through well-architected frameworks.

Data Lake

The Data Lake plays a critical role in the Modern Data Lakehouse Platform, primarily because all enterprise data is stored within the lake in various formats. The lake can support structured, semi-structured, and unstructured data. Since storage within the lake is cheaper than OLTP database or OLAP data warehouses, several big data and advanced analytics use cases with differing velocities of data, such as batch and real-time, can easily be accommodated within the data lake. Also, most modern versions of Data Lakes across cloud providers can enable the 'Hierarchical Namespace –(HNS)' feature. This will allow the objects and files within the lake containers to be organized into pre-defined hierarchies of directories and nested subdirectories. A storage account with this hierarchical namespace enabled inherits the capability of providing the scalability and cost-effectiveness of object storage, with file system semantics familiar to analytics engines and frameworks. This HNS feature is also the key differentiator between a general Blob Storage account. For example, within the Azure Cloud ecosystem, this HNS-enabled storage account is called Data Lake Storage Gen 2.

From a file format perspective, typically snappy compressed parquet files are the most optimal formats for highly performant analytics workloads due to their columnar format and ~97.5% compression ratio. An advancement of this Azure Data Lake Storage Gen2 account is the addition of the support for 'Delta,' which is also based on the general columnar-oriented 'parquet' file format, which is best suited for analytics workloads. Delta transaction log files provide ACID transactions and isolation levels to Spark for processing data within your Lakehouse. Delta supports caching, time travel, merging datasets, optimization, and schema evolution, among other features. This makes it a prime candidate for data storage within the modern Data Lakehouse Platform. The Figure below of an Azure Modern Data Lakehouse Platform highlights the data lake's role within this platform. Zones (Bronze, Silver, Gold), which we will cover in a later section, can be designed to capture various stages of data storage and processing in Delta format as your enterprise data gets ingested, transformed, and served downstream to a variety of consumers through workspaces and reporting tools. In addition to consumption layer tools, most cloud technologies also have the capabilities of connecting to the Data Lake.

Lakehouse Platform Highlighting the ADLSgen2 components

Data Lake Design and Management

When designing and managing data within the Data Lake, it is always important to focus on security implications early and design data partitions together with authorization. We will explore security in a later section, but these are typically defined by Active Directory (AD) Groups, Role Based Access Controls (RBAC), and Access Control Lists (ACL). Also, when designing a data lake, keep in mind that data redundancy might be allowed in exchange for security. When consistent, multiple nesting of folders is acceptable. Also, it is good practice to collocate similar file formats and datasets within a single folder structure. A good folder structure must not begin with date partitions; the dates must reside within the lower folder levels. Defining and adhering to a naming convention is critical to a good data lake design. Also pivotal to good data lake design is to include time elements in the folder structure and file name, when appropriate.

Zones typically define the root-level folder hierarchies in a data lake container. Zones do not always need to reside in the same physical data lake and could also reside as separate filesystems, storage accounts, or even in different subscriptions. Multiple storage accounts in different subscriptions may be a good idea for large throughput requirements exceeding a request rate of 20,000 per second. The figure below illustrates the typical zones within a Data Lake and their purpose. Modern Data Lakehouse platforms follow a Medallion style design wherein the bronze zone is a raw storage layer that contains data sourced from various on-premises and cloud-based systems. It is segregated by source system, dataset, and time-based (e.g., Year, Month, Day) hierarchies. It serves as the Enterprise's Landing Zone. Similarly, the silver zone contains data that is sanitized, enhanced, and staged for further analysis. For example, this can include the de-sensitization of PII data and the deduplication of raw data from the bronze zone. Lastly, the gold zone contains transformed, aggregated, and modeled data processed from the silver zone. For example, this zone can include Facts and Dimensions and typically stores data ready for consumption by end users. Organizations introduce a variety of other zones as needed and dependant on the use case. We will explore the design of these and other zones in a later section.

There are several options for storing and managing data within a Data Lake. Since the underlying storage of the lake is essentially object-oriented, folder and file hierarchical structures can be defined in many unique ways to meet the specific use cases of the organization, the customers, and their departmental or program-specific use cases. The figure below illustrates some of these options. Zones can be defined by multiple folders in a container, as shown in Option 1. Alternatively, zones can be scoped at the container level by multiple containers within a storage account, as shown in Option 2. Finally, zones can also be defined by multiple storage accounts, as shown in Option 3.

DataLakeManagement Lakehouse data management, hierarchy options

Bronze Zone

The Bronze Zone serves the purpose of storing vast quantities, varieties and velocities of raw data. This data can be batch, streaming, structured, un-structures, or semi-structured. The figure below shows a sample naming convention and hierarchical folder and file structure for designing a high-quality Bronze Zone. Notice that the naming conventions adhere to good design patterns, the dates reside within the lower levels of the folder structure, and the file name contains a timestamp.